Design and Implementation of High Performance Computing Cluster for Educational Purpose

Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Engineering

by SURAJ CHAVAN Roll No: 121022015

Under the guidance of PROF. S. U. GHUMBRE

Department of Computer Engineering and Information Technology College of Engineering, Pune Pune - 411005.

June 2012 Dedicated to My Mother Smt. Kanta Chavan DEPARTMENT OF COMPUTER ENGINEERING AND INFORMATION TECHNOLOGY, COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation titled Design and Implementation of High Performance Computing Cluster for Educational Purpose

has been successfully completed

By

SURAJ CHAVAN (121022015)

and is approved for the degree of

Master of Technology, Computer Engineering.

PROF. S. U. GHUMBRE, DR. JIBI ABRAHAM, Guide, Head, Department of Computer Engineering Department of Computer Engineering and Information Technology, and Information Technology, College of Engineering, Pune, College of Engineering, Pune, Shivaji Nagar, Pune-411005. Shivaji Nagar, Pune-411005.

Date : Abstract

This project work confronts the issue of bringing high performance computing (HPC) education to those who do not have access to a dedicated clustering en- vironment in an easy, fully-functional, inexpensive manner through the use of normal old PCs, fast Ethernet and free and open source softwares like Linux, MPICH, Torque, Maui etc. Many undergraduate institutions in India do not have the facilities, time, or money to purchase hardware, maintain user accounts, configure software components, and keep ahead of the latest security advisories for a dedicated clustering environment. The projects primary goal is to provide an instantaneous, distributed computing environment. A consequence of provid- ing such an environment is the ability to promote the education of high perfor- mance computing issues at the undergraduate level through the ability to turn an ordinary off the shelf networked into a non-invasive, fully-functional cluster. The cluster is used to solve problems which require high degree of com- putation like satisfiability problem for Boolean circuits, Radix-2 FFT algorithm, 1 dimensional time dependent heat equation and other. Also the cluster is bench- marked by using High Performance Linpack and HPCC benchmark suite. This cluster can be used for research on data mining applications with large data sets, object-oriented parallel languages, recursive matrix algorithms, network protocol optimization, graphical rendering, Fast Fourier transforms, built college’s private cloud etc. Using this cluster students and faculty will receive extensive experience in configuration, troubleshooting, utilization, debugging and administration issues uniquely associated with parallel computing using such cluster. Several students and faculty can use it for their project and research work in near future.

iii Acknowledgments

It is great pleasure for me to acknowledge the assistance and contribution of num- ber of individuals who helped me in my project titled Design and Implementation of HPCC for Educational Purpose. First and foremost I would like to express deepest gratitude to my Guide Prof. S.U. Ghumbre who has encouraged, supported and guided me during every step of the Project. Without his invaluable advice completion of this project would not be possible. I take this opportunity to thank our Head of Department, Prof. Dr. Jibi Abraham for her able guidance and for providing all the necessary facilities, which were indispensable in the completion of this project. I am also thankful to the staff of Computer Engineering Department for their invaluable suggestions and advice. I thank the college for providing the required magazines, books and access to the Internet for collecting information related to the Project. I am thankful to Dr. P. K. Sinha, Senior Director HPC, C-DAC, Pune for granting me permission to study C-DAC’s PARAM Yuva facility. I am also thank- ful to Dr. Sandeep Joshi and Mr. Rishi Pathak, Mr. Vaibhav Pol of PARAM Yuva Supercomputing facility, C-DAC, Pune for their continuous encouragement and support throughout the course of this project. Last, but not the least, I am also grateful to my friends for their valuable comments and suggestions.

iv Contents

Abstract iii

Acknowledgments iv

List of Figures vi

1 Introduction1 1.1 High Performance Computing...... 1 1.1.1 Types of HPC architectures...... 2 1.1.2 Clustering...... 3 1.2 Characteristics and features of clusters...... 4 1.3 Motivation...... 5 1.3.1 Problem Definition...... 5 1.3.2 Scope...... 5 1.3.3 Objectives...... 5

2 Literature Survey6 2.1 HPC oppurtunities in Indian Market...... 6 2.2 HPC at Indian Educational Institutes...... 6 2.3 C-DAC...... 7 2.3.1 C-DAC and HPC...... 7 2.4 PARAM Yuva...... 8 2.5 Grid Computing...... 10 2.5.1 GARUDA: The National Grid Computing Initiative of India 10 2.5.2 Garuda: Objectives...... 11 2.6 Flynn’s Taxonomy...... 11 2.7 Single Program, Multiple Data (SPMD)...... 13 2.8 Message Passing and Parallel Programming Protocols...... 14 2.8.1 Message Passing Models...... 14 2.9 Speedup and Efficiency...... 18 2.9.1 Speedup...... 18 2.9.2 Efficiency...... 18 2.9.3 Factors affecting performance...... 19 2.9.4 Amdahl’s Law...... 21 2.10 Maths Libraries...... 22 2.11 HPL Benchmark...... 24 2.11.1 Description of the HPL.dat File...... 25 2.11.2 Guidelines for HPL.dat configuration...... 30 2.12 HPCC Challenge Benchmark...... 32

3 Design and Implementation 35 3.1 Beowulf Clusters: A Low cost alternative...... 35 3.2 Logical View of proposed Cluster...... 36 3.3 Hardware Configuration...... 36 3.3.1 Master Node...... 36 3.3.2 Compute Nodes...... 37 3.3.3 Network...... 37 3.4 Softwares...... 38 3.4.1 MPICH2...... 39 3.4.2 HYDRA: Process Manager...... 44 3.4.3 TORQUE: Resource Manager...... 44 3.4.4 MAUI: Cluster Scheduler...... 45 3.5 System Considerations...... 46

4 Experiments 48 4.1 Finding Prime Numbers...... 48 4.2 PI Calculation...... 49 4.3 Circuit Satisfiability Problem...... 50 4.4 1D Time Dependent Heat Equation...... 51 4.4.1 The finite difference discretization...... 51 4.4.2 Using MPI to compute the solution...... 53 4.5 Fast Fourier Transform...... 53 4.5.1 Radix-2 FFT algorithm...... 54 4.6 Theoretical Peak Performance...... 55 4.7 Benchmarking...... 56 4.8 HPL...... 56 4.8.1 HPL Tuning...... 56 4.8.2 Run HPL on cluster...... 58

vi 4.8.3 HPL results...... 59 4.9 Run HPCC on cluster...... 60 4.9.1 HPCC Results...... 61

5 Results and Applications 63 5.1 Discussion on Results...... 63 5.1.1 Observations about Small Tasks...... 63 5.1.2 Observations about Larger Tasks...... 63 5.2 Factors affecting Cluster performance...... 64 5.3 Benefits...... 64 5.4 Challenges of parallel computing...... 65 5.5 Common applications of high-performance computing clusters... 67

6 Conclusion and Future Work 69 6.1 Conclusion...... 69 6.2 Future Work...... 69

Bibliography 71

Appendix A PuTTy 74 A.1 How to use PuTTY to connect to a remote computer...... 74 A.2 PSCP...... 75 A.2.1 Starting PSCP...... 76 A.2.2 PSCP Usage...... 76

vii List of Figures

1.1 Basic Cluster...... 3

2.1 Evolution of PARAM Supercomputers & HPC Roadmap...... 8 2.2 Block Diagram of PARAM Yuva...... 9 2.3 Single Instruction, Multiple Data streams (SISD)...... 12 2.4 Single Instruction, Multiple Data streams (SIMD)...... 12 2.5 Multiple Instruction, Single Data stream (MISD)...... 13 2.6 Multiple Instruction, Multiple Data streams (MIMD)...... 13 2.7 General MPI Program Structure...... 17 2.8 Speedup of a program using multiple processors...... 21

3.1 The Schematic structure of proposed cluster...... 35 3.2 Logical view of proposed cluster...... 36 3.3 The Network interconnection...... 38

4.1 Graph showing performance for Finding Primes...... 49 4.2 Graph showing performance for Calculating π ...... 50 4.3 Graph showing performance for solving C-SAT Problem...... 51 4.4 Graph showing performance for solving 1D Time Dependent Heat Equation...... 52 4.5 Symbolic relation between four nodes...... 52 4.6 Graph showing performance Radix-2 FFT algorithm...... 54 4.7 8-point Radix-2 FFT: Decimation in frequency form...... 55 4.8 Graph showing High Performance Linpack (HPL) Results...... 60

5.1 Application Perspective of Grand Challenges...... 67

A.1 Putty GUI...... 75 A.2 Putty Security Alert...... 75 A.3 Putty Remote Login Screen...... 76 Chapter 1

Introduction

HPC is a collection or cluster of connected, independent computers that work in unison to solve a problem. In general, the machines are tightly coupled at one site, connected by Infiniband or some other high-speed interconnect technology. With HPC, the primary goal is to crunch numbers, not to sort data. It demands specialized program optimizations to get the most from a system in terms of input/output, computation, and data movement. And the machines all have to trust each other because theyre shipping information back and forth. Development of new materials and production processes, based on high-technologies, requires a solution of increasingly complex computational problems. However, even as computer power, data storage, and communication speed continue to im- prove exponentially, available computational resources are often failing to keep up with what users demand of them. Therefore high-performance computing (HPC) infrastructure becomes a critical resource for research and development as well as for many business applications. Traditionally the HPC applications were oriented on the use of high-end computer systems - so-called ”supercomputers”.

1.1 High Performance Computing

The High Performance Computing (HPC) allows scientists and engineers to deal with very complex problems using fast computer hardware and specialized soft- ware. Since often these problems require hundreds or even thousands of processor hours to complete, an approach, based on the use of supercomputers, has been tra- ditionally adopted. Recent tremendous increase in a speed of PC-type computers opens relatively cheap and scalable solution for HPC using cluster technologies. Linux clustering is popular in many industries these days. With the advent of clustering technology and the growing acceptance of open source software, su-

1 1.1 High Performance Computing

percomputers can now be created for a fraction of the cost of traditional high- performance machines. Cluster operating systems divide the tasks amongst the available systems. Clusters of systems or workstations, on the other hand, connect a group of systems together to jointly share a critically demanding computational task. Theoretically, a cluster operating system should provide seamless optimization in every case. At the present time, cluster server and workstation systems are mostly used in High Availability applications and in scientific applications such as numerical computations.

1.1.1 Types of HPC architectures

Most HPC systems use the concept of parallelism. Many software platforms are oriented for HPC, but first let’s look at the hardware aspects. HPC hardware falls into three categories:

• Symmetric multiprocessors (SMP)

• Vector processors

• Clusters

Symmetric multiprocessors (SMP)

SMP is a type of HPC architecture in which multiple processors share the same memory. (In clusters, also known as massively parallel processors (MPPs), they don’t share the same memory.) SMPs are generally more expensive and less scal- able than MPPs.

Vector processors

In vector processors, the CPU is optimized to perform well with arrays or vectors; hence the name. Vector processor systems deliver high performance and were the dominant HPC architecture in the 1980s and early 1990s, but clusters have become far more popular in recent years.

Clusters

Clusters are the predominant type of HPC hardware these days; a cluster is a set of MPPs. A processor in a cluster is commonly referred to as a node and has its own CPU, memory, operating system, and I/O subsystem and is capable of

2 1.1 High Performance Computing

communicating with other nodes. These days it is common to use a commodity workstation running Linux and other open source software as a node in a cluster. Clustering is the use of multiple computers, typically PCs or UNIX work- stations, multiple storage devices, and redundant interconnections, to form what appears to users as a single highly available system. Cluster computing can be used for load balancing, high performance computing as well as for high avail- ability. It is used as a relatively low-cost form of parallel processing machine for scientific and other applications that lend themselves to parallel operations. The Figure 1.1 illustrates a basic cluster.

Figure 1.1: Basic Cluster

Computer cluster technology puts clusters of systems together to provide better system reliability and performance. Cluster server systems connect a group of systems together in order to jointly provide processing service for the clients in the network.

1.1.2 Clustering

The term ”cluster” can take different meanings in different contexts. This section focuses on three types of clusters:

• Fail-over clusters

• Load-balancing clusters

• High-performance clusters

3 1.2 Characteristics and features of clusters

Fail-over clusters

The simplest fail-over cluster has two nodes: one stays active and the other stays on stand-by but constantly monitors the active one. In case the active node goes down, the stand-by node takes over, allowing a mission-critical system to continue functioning.

Load-balancing clusters

Load-balancing clusters are commonly used for busy Web sites where several nodes host the same site, and each new request for a Web page is dynamically routed to a node with a lower load.

High-performance clusters

These clusters are used to run parallel programs for time-intensive computations and are of special interest to the scientific community. They commonly run simu- lations and other CPU-intensive programs that would take an inordinate amount of time to run on regular hardware.

1.2 Characteristics and features of clusters

1. Very high performance-price ratio.

2. Recycling possibilities of the hardware components.

3. Guarantee of usability/upgradeability in the future.

4. Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.

5. Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.

6. Open source software components and Linux lead to lower software costs.

7. Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).

4 1.3 Motivation

1.3 Motivation

1.3.1 Problem Definition

A is a group of linked computers, working together closely thus in many respects forming a single computer. High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computation prob- lems. The benefits of HPCC (High-Performance Computing Clusters) is availabil- ity, scalability and to a lesser extent, investment protection and simple adminis- tration. Portable and extensible parallel computing system has been build with a out- standing capability near the commercial high performance supercomputer using general PCs, network facilities, and open softwares such as Linux and MPI etc. To check clusters performance, the popular HPL (High Performance Linpack) Benchmark and HPPCC Benchmark suite are used.

1.3.2 Scope

Computing clusters provide a reasonably inexpensive method to aggregate com- puting power and dramatically cut the time needed to find answers in research that requires the analysis of vast amounts of data. This HPCC can be used for research on object-oriented parallel languages, recursive matrix algorithms, network protocol optimization, graphical rendering etc. Also it can be used to create college’s own cloud and deploy cloud applications on it, which can be accessed from anywhere outside world just with the help of web browser.

1.3.3 Objectives

The projects primary goal is to support an instantaneous, easily available dis- tributed computing environment. A consequence of providing such an environment is the ability to promote the education of high performance computing issues at the undergraduate level through the ability to turn an ordinary off the shelf net- worked computers into a non-invasive, fully-functional cluster. Using this cluster students and teachers will be able to gain insight into configuration, utilization, troubleshooting, debugging, and administration issues uniquely associated with parallel computing in a live, easy to use clustering environment.Availability of such system will encourage more and more students and faculty to use it for their project and research work.

5 Chapter 2

Literature Survey

2.1 HPC oppurtunities in Indian Market

While sectors such as education, R&D, biotechnology, and weather forecasting have taken some good lead, it is likely to see industries such as oil & gas catching up soon. But challenges remain, largely on the application side, such as there is need for more homegrown applications. Today, bulk of the codes is serial, running multiple instances of the same code. There is genuine need to focus on code-parallelization to leverage the true power of HPC. Also, the trend in HPC is toward packing more and more power into less and less footprint and at the lowest possible price. Getting people from diverse domains to share and collaborate along one plat- form is the other challenge facing HPC deployment.

2.2 HPC at Indian Educational Institutes

India has the potential to be a global technology leader. Indian industry is com- peting globally in various sectors of Science and Engineering. A critical issue for the future success of state & Indian industry is the growth of engineering and re- search education in India. High Performance Computing power is key to scientific & engineering leadership, industrial competitiveness, and national security. Right now the hardware and expertise which is needed for such systems is available with few top notch colleges like IISc, IITs and few other renowned institutes. But if we want to harness the true power of HPC we have to make sure that such systems are available to each and every engineering college.

6 2.3 C-DAC

2.3 C-DAC

C-DAC was set-up in 1988 with the explicit purpose of demonstrating India’s HPC capability after the US government denied the import of technology for weather forecasting purposes. Since then, C-DAC’s developments have mirrored the progress of HPC computing worldwide. During the second mission, C-DAC advented the Open Frame Architecture for cluster computing culminating in the PARAM 10000 in 1998 and the 1TF PARAM Padma in 2002. Along with 60 installations worldwide, C-DAC, now has two HPC facilities of its own, The 100 GF (GigaFlop) PARAM 10000 at the National Param Super- computing Facility (NPSF) at Pune and the 1 TF (TeraFlop) PARAM Padma at the C-DAC’s Terascale Supercomputing Facility (CTSF) at Bangalore. The indigenously built PARAM Padma debuted on the Top500 list of supercomputers at 171 in May 2003. After the completion of PARAM Padma (1 TF peak computing power, subse- quently upgraded by another 1TF peak) in December 2002 and it’s dedication to the nation in June 2003, it was used extensively as a third party facility (CTSF) by a wide spectrum of users from academia, research labs and end-user agencies. In addition, C-DAC has been actively working since then to build its Next Gener- ation HPC system (Param NG) and associated technology components. C-DAC commissioned the System called PARAM ”Yuva” in November 2008. This system with Rmax (Sustained Performance) of 37.80 TFs and Rpeak (Peak Performance) of 54.01 TFs, has been ranked at One Hundred Nine (109th) in TOP500 Systems enlisted, as per the analysis released in June 2009. The system is an intermediate milestone of C-DAC’s HPC Roadmap towards Petaflop Computing by 2012. C-DAC has made significant contributions to the Indian HPC arena in terms of awareness (by means of training programmes), consultancy, skilled manpower and technology development as well as through deployment of systems and solutions for use by the scientific, engineering and business community.

2.3.1 C-DAC and HPC

C-DAC has taken the initiative in conducting national awareness programs in High Performance computing for the Scientific and Engineering community and welcome to establish High Performance Computing Labs in all the universities and colleges. This shall help in capacity building and act as a computational research centers for the scientific & academic programs which will address & catalyse the

7 2.4 PARAM Yuva

Figure 2.1: Evolution of PARAM Supercomputers & HPC Roadmap impact of high quality engineering education and high-end computational work for the research community in the eastern region. It will also promote research and teaching by integrating leading edge, high performance computing and visu- alization for the faculties, students, graduate and post graduates of the institute and will provide solutions to many of our most pressing national challenges.

2.4 PARAM Yuva

The latest in the series is called PARAM Yuva, which was developed last year and was ranked 68th in the TOP500 list released in November 2008 at the Super- computing Conference in Austin, Texas, United States. The system, according to C-DAC scientists, is an intermediate milestone of C-DACs HPC road map towards achieving petaflops (million billion flops) computing speed by 2012. As part of this, C-DAC has also set up a National PARAM Supercomputing Facility (NPSF) in Pune, where C-DAC is headquartered, to allow researchers access to HPC systems to address their computer-intensive problems. C-DACs efforts in this strategically and economically important area have thus put India on the supercomputing map of the world along with select developed nations of the world. As of 2008, 52 PARAM systems have been deployed in the country and abroad, eight of them at locations in Russia, Singapore, Germany and Canada. The PARAM series of cluster computing systems is based on what is called OpenFrame Architecture. PARAM Yuva, in particular, uses a high-speed 10 gi- gabits per second (Gbps) system area network called PARAM Net-3, developed

8 2.4 PARAM Yuva indigenously by C-DAC over the last three years, as the primary interconnect. This HPC cluster system is built with nodes designed around state-of-the-art ar- chitecture known as X-86 based on Quad Core processors. In all, PARAM Yuva, in its complete configuration, has 4,608 cores of Intel Xeon 73XX processors called Tigerton with a clock speed of 2.93 gigahertz (GHz). The system has a sustained performance of 37.8 Tflops and a peak speed of 54 Tflops.

Figure 2.2: Block Diagram of PARAM Yuva

A novel feature of PARAM Yuva is its reconfigurable computing (RC) capabil- ity, which is an innovative way of speeding up HPC applications by dynamically configuring hardware to a suite of algorithms or applications run on PARAM Yuva for the first time. The RC hardware essentially uses acceleration cards as external add-ons to boost speed significantly while saving on power and space. C-DAC is one of the first organisations to bring the concept of reconfigurable hardware re- sources to the country. C-DAC has not only implemented the latest RC hardware, it has also developed system software and hardware libraries to achieve appropriate accelerations in performance. As C-DAC has been scaling different milestones in HPC hardware, it has also been developing HPC application software, providing end-to-end solutions in an HPC environment to different end-users on mission mode. Only in early January, C-DAC set up a supercomputing facility around a scaled-down version of PARAM Yuva at North-Eastern Hill University (NEHU) in Shillong complete with all allied C-DAC technology components and application software.

9 2.5 Grid Computing

2.5 Grid Computing

Grid computing is a term referring to the federation of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files. What distinguishes grid computing from conventional high per- formance computing systems such as cluster computing is that grids tend to be more loosely coupled, heterogeneous, and geographically dispersed. Although a grid can be dedicated to a specialized application, it is more common that a single grid will be used for a variety of different purposes. Grids are often constructed with the aid of general-purpose grid software libraries known as middleware. Grid size can vary by a considerable amount. Grids are a form of distributed computing whereby a super virtual computer is composed of many networked loosely coupled computers acting together to perform very large tasks. For certain applications, distributed or grid computing, can be seen as a special type of parallel computing that relies on complete computers (with onboard CPUs, storage, power supplies, network interfaces, etc.) connected to a network (private, public or the Internet) by a conventional network interface, such as Ethernet. This is in contrast to the traditional notion of a supercomputer, which has many processors connected by a local high-speed computer bus.

2.5.1 GARUDA: The National Grid Computing Initiative of India

GARUDA is a collaboration of science researchers and experimenters on a nation- wide grid of computational nodes, mass storage and scientific instruments that aims to provide the technological advances required to enable data and compute intensive science for the 21st century. One of GARUDA’s most important chal- lenges is to strike the right balance between research and the daunting task of deploying innovation into some of the most complex scientific and engineering endeavors being undertaken today. Building a commanding position in Grid computing is crucial for India. By allowing researchers to easily access supercomputer-level processing power and knowledge resources, grids will underpin progress in Indian science, engineering and business. The challenge facing India today is to turn technologies developed for researchers into industrial strength business tools. The Department of Information Technology (DIT), Government of India has funded the Centre for Development of Advanced Computing (C-DAC) to deploy

10 2.6 Flynn’s Taxonomy the nationwide computational grid GARUDA’ which will connect 17 cities across the country in its Proof of Concept (PoC) phase with an aim to bring ”Grid” networked computing to research labs and industry. GARUDA will accelerate India’s drive to turn its substantial research investment into tangible economic benefits.

2.5.2 Garuda: Objectives

GARUDA aims at strengthening and advancing scientific and technological excel- lence in the area of Grid and Peer-to-Peer technologies. The strategic objectives of GARUDA are to: Create a test bed for the research and engineering of technolo- gies, architectures, standards and applications in Grid Computing Bring together all potential research, development and user groups who can help develop a na- tional initiative on Grid computing Create the foundation for the next generation grids by addressing long term research issues in the strategic areas of knowledge and data management, programming models, architectures, grid management and monitoring, problem solving environments, grid tools and services The following key deliverables have been identified as important to achieving the GARUDA objectives: Grid tools and services to provide an integrated in- frastructure to applications and higher-level layers A Pan-Indian communication fabric to provide seamless and high-speed access to resources Aggregation of re- sources including compute clusters, storage and scientific instruments Creation of a consortium to collaborate on grid computing and contribute towards the ag- gregation of resources Grid enablement and deployment of select applications of national importance requiring aggregation of distributed resources To achieve the above objectives, GARUDA brings together a critical mass of well-established researchers from 45 research laboratories and academic institu- tions that have formulated an ambitious program of activities.

2.6 Flynn’s Taxonomy

The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) and data streams available in the architecture:

Single Instruction, Single Data stream (SISD) A sequential computer which exploits no parallelism in either the instruction or data streams. Single control unit (CU) fetches single Instruction Stream (IS) from

11 2.6 Flynn’s Taxonomy memory. The CU then generates appropriate control signals to direct single pro- cessing element (PE) to operate on single Data Stream (DS) i.e. one operation at a time.

Figure 2.3: Single Instruction, Multiple Data streams (SISD)

Examples of SISD architecture are the traditional uniprocessor machines like a PC (currently manufactured PCs have multiple processors) or old mainframes.

Single Instruction, Multiple Data streams (SIMD)

Figure 2.4: Single Instruction, Multiple Data streams (SIMD)

A computer which exploits multiple data streams against a single instruction stream to perform operations which may be naturally parallelized. For example, an array processor or GPU.

Multiple Instruction, Single Data stream (MISD) Multiple instructions operate on a single data stream. Uncommon architecture which is generally used for fault tolerance. Heterogeneous systems operate on the same data stream and must agree on the result.

12 2.7 Single Program, Multiple Data (SPMD)

Figure 2.5: Multiple Instruction, Single Data stream (MISD)

Examples include the Space Shuttle flight control computer.

Multiple Instruction, Multiple Data streams (MIMD) Multiple autonomous processors simultaneously executing different instructions on different data. Dis- tributed systems are generally recognized to be MIMD architectures; either ex- ploiting a single shared memory space or a distributed memory space. A multi-core superscalar processor is an MIMD processor.

Figure 2.6: Multiple Instruction, Multiple Data streams (MIMD)

2.7 Single Program, Multiple Data (SPMD)

Proposed cluster is mostly using variation of MIMD category i.e. SPMD. Multiple autonomous processors simultaneously executing the same program (but at inde- pendent points, rather than in the lockstep that SIMD imposes) on different data. Also referred to as ’Single Process, multiple data’ - the use of this terminology for SPMD is erroneous and should be avoided, SPMD is a parallel execution model and assumes multiple cooperating processes executing a program. SPMD is the

13 2.8 Message Passing and Parallel Programming Protocols most common style of parallel programming. The SPMD model and the term was proposed by Frederica Darema.

2.8 Message Passing and Parallel Programming Protocols

Message passing is a form of communication used in parallel computing, object- oriented programming, and interprocess communication. In this model processes or objects can send and receive messages (comprising zero or more bytes, complex data structures, or even segments of code) to other processes. By waiting for messages, processes can also synchronize. Three protocols are presented here for parallel programming, one which has become the standard, one which used to be the standard, and one which some feel might be the next big thing. For a while there, the parallel protocol war was being waged over PVM and MPI. By most everyone’s account, MPI won. It is a highly efficient and easy to learn protocol that has been implemented on a wide variety of platforms. One criticism is that different implementations of MPI don’t always talk to one another. However, most cluster install packages give both of the two most common implementations (MPICH and LAM/MPI). If setting up a small cluster, choose freely between either, they both work well, and as long as there is same version of MPI on each machine, there is no need to rewrite any MPI code. MPI stands for Message Passing Interface. Basically, independent processes send messages to each other. Both LAM/MPI and MPICH simplify the process of starting large jobs on multiple machines. It is the most common and efficient parallel protocol in current use.

2.8.1 Message Passing Models

Message passing models for parallel computation have been widely adopted be- cause of their similarity to the physical attributes of many multiprocessor architec- tures. Probably the most widely adopted message passing model is MPI. MPI, or Message Passing Interface, was released in 1994 after two years in the design phase. MPIs functionality is fairly straightforward. For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels.

14 2.8 Message Passing and Parallel Programming Protocols

MPICH2 programs are written in C or FORTRAN and linked against the MPI libraries; C++ and Fortran90 bindings are also supported. MPI applications run in a multiple-instruction multiple-data (MIMD) manner.

MPI

MPI provides a straight-forward interface to write software that can use multiple cores of a computer, and multiple computers in a cluster or nodes in a supercom- puter. Using MPI write code that uses all of the cores and all of the nodes in a multicore computer cluster, and that will run faster as more cores and more compute nodes become available. MPI is a well-established, standard method of writing parallel programs. It was first released in 1992, and is currently on version 2.1.4.1. MPI is implemented as a library, which is available for nearly all computer platforms (e.g. Linux, Windows, OS X), and with interfaces for many popular languages (e.g. C, C++, Fortran, Python). MPI stands for ”Message Passing Interface”, and it parallelizes computational work by providing tools that use a team of processes to solve the problem, and for the team to then share the solution by passing messages amongst one another. MPI can be used to parallelize programs that run locally, by having all processes in the team run locally, or it can be used to parallelize programs across a compute cluster, by running one or more processes per node. MPI can be combined with other parallel programming technologies, e.g. OpenMP.

Basic MPI Calls

It is often said that there are two views of MPI. One view is that MPI is a lightweight protocol with only 6 commands. The other view is that it is a in depth protocol with hundreds of specialized commands. The 6 Basic MPI Commands

• MPI Init

• MPI Comm size

• MPI Comm rank

• MPI Send

• MPI Recv

15 2.8 Message Passing and Parallel Programming Protocols

• MPI Finalize

In short, set up an MPI program, get the number of processes participating in the program, determine which of those processes corresponds to the one calling the command, send messages, receive messages, and stop participating in a parallel program.

1. MPI Init(int *argc, char ***argv) Takes the command line arguments to a program, checks for any MPI options, and passes remaining command line arguments to the main program.

2. MPI Comm size( MPI Comm comm, int *size ) Determines the size of a given MPI Communicator. A communicator is a set of processes that work together. For typical programs this is the default MPI COMM WORLD, which is the communicator for all processes available to an MPI program.

3. MPI Comm rank( MPI Comm comm, int *rank ) Determine the rank of the current process within a communicator. Typically, if a MPI program is being run on N processes, the communicator would be MPI COMM WORLD, and the rank would be an integer from 0 to N-1.

4. MPI Send( void *buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm ) Send the contents of buf, which contains count elements of type datatype to a process of rank dest in the communicator comm, flagged with the message tag. Typically, the communicator is MPI COMM WORLD.

5. MPI Recv( void *buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Status *status ) Read into buf count values of type datatype from process source in communicator comm if a message is sent flagged with tag. Also receive information about the transfer into status.

6. MPI Finalize() Handles anything that the current MPI protocol will need to do before exiting a program. Typically should be the final or near final line of a program.

MPICH2: Message Passing Interface

The MPICH implementation of MPI is one of the most popular versions of MPI. Recently, MPICH was completely rewritten; the new version is called MPICH2 and includes all of MPI, both MPI-1 and MPI-2. This section describes how to obtain,

16 2.8 Message Passing and Parallel Programming Protocols

Figure 2.7: General MPI Program Structure build, and install MPICH2 on a Beowulf cluster. Then it describes how to set up an MPICH2 environment in which MPI programs can be compiled, executed, and debugged. MPICH2 is recommended for all Beowulf clusters by many researchers. Original MPICH is still available but is no longer being developed.

PVM

PVM (Parallel Virtual Machine) is a freely-available, portable, message-passing library generally implemented on top of sockets. PVMs daemon based implemen- tation makes it easy to start large jobs on multiple machines. PVM was the first standard for parallel computing to become widely accepted. As a result, there is a large amount of legacy code in PVM still available. PVM also allows for the ability to spawn multiple programs from within the original program. PVM easily recursively spawn other processes. It is simple implementation that works across different platforms. Now a days people having legacy code in PVM that they don’t want to modify are using it.

JavaSpaces

Java is a versatile computer language that is object oriented and is widely used in computer science schools around the country. JavaSpaces is Java’s parallel programming framework which operates by writing entries into a shared space. Programs can access the space, and either add an entry, read an entry without removing it, or take an entry.

17 2.9 Speedup and Efficiency

Java is an interpreted language, and as such typical programs will not run at the same speed as compiled languages such as C/C++ and Fortran. However, much progress has been made in the area of Java efficiency, and many operating systems have what are known as just-in-time compilers. Current claims are that a well optimized java platform can run java code at about 90% of the speed of similar C/C++ code. Java has a versatile security policy that is extremely flexible, but also can be difficult to learn. JavaSpaces suffers from high latency and a lack of network optimization, but for embarrasingly parallel problems that do not require synchronization, the JavaS- paces model of putting jobs into a space, letting any ”worker” take jobs out of the space, and having the workers put results into the space when done leads to very natural approaches to load balancing and may be well suited to non-coupled highly distributed computations, such as SETI@Home. JavaSpaces does not have any simple mechanism for starting large jobs on multiple machines. Javaspaces is good choice if need to pass not just data, but instructions on what to do with that data. Also it provides object oriented parallel framework.

2.9 Speedup and Efficiency

2.9.1 Speedup

The speedup of a parallel code is how much faster it runs in parallel. If the time it takes to run a code on 1 processors is T1 and the time it takes to run the same code on N processors is TN, then the speedup is given by

T S = 1 TN

This can depend on many things, but primarily depends on the ratio of the amount of time the code spends communicating to the amount of time it spends computing.

2.9.2 Efficiency

Efficiency is a measure of how much of available processing power is being used. The simplest way to think of it is as the speedup per processor. This is equivalent to defining efficiency as the time to run N models on N processors to the time to

18 2.9 Speedup and Efficiency run 1 model on 1 processor.

S T E = = 1 N N × TN

This gives a more accurate measure of the true efficiency of a parallel program than CPU usage, as it takes into account redundant calculations as well as idle time.

2.9.3 Factors affecting performance

The factors which can affect an MPI application’s performance are numerous, complex and interrelated. Because of this, generalizing about an application’s performance is usually very difficult. Most of the important factors are briefly described below.

Platform / Architecture Related

1. cpu - clock speed, number of cpus

2. Memory subsystem - memory and cache configuration, memory-cache-cpu bandwidth, memory copy bandwidth

3. Network adapters - type, latency and bandwidth characteristics

4. Operating system characteristics - many

Network Related

1. Protocols - TCP/IP, UDP/IP, other

2. Configuration, routing, etc

3. Network tuning options (”no” command)

4. Network contention / saturation

Application Related

1. Algorithm efficiency and scalability

2. Communication to computation ratios

3. Load balance

19 2.9 Speedup and Efficiency

4. Memory usage patterns

5. I/O

6. Message size used

7. Types of MPI routines used - blocking, non-blocking, point-to-point, collec- tive communications

MPI Implementation Related

1. Message buffering

2. Message passing protocols - eager, rendezvous, other

3. Sender-Receiver synchronization - polling, interrupt

4. Routine internals - efficiency of algorithm used to implement a given routine

Network Contention

1. Network contention occurs when the volume of data being communicated between MPI tasks saturates the bandwidth of the network.

2. Saturation of the network bandwidth results in an overall decrease of com- munications performance for all tasks.

Because of these challenges and complexities, performance analysis tools are essen- tial to optimizing an application’s performance. They can assist in understanding what program is ”really doing” and suggest how program performance should be improved. The primary issue with speedup is the communication to computation ratio. To get a higher speedup,

• Communicate less

• Compute more

• Make connections faster

• Communicate faster

20 2.9 Speedup and Efficiency

The amount of time the computer requires to make a connection to another computer is referred to as its latency, and the rate at which data can be transferred is the bandwidth. Both can have an impact on the speedup of a parallel code. Collective communication can also help speed up the code. As an example, imagine you are trying to tell a number of people about a party. One method would be to tell each person individually, another would be to tell people to ”spread the word”. Collective communication refers to improving communication speed by having any node with the information being sent participate in sending the infor- mation to other nodes. Not all protocols allow for collective communication, and even protocols which do may not require a vendor to implement collective com- munication. An example is the broadcast routine in MPI. Many vendor specific versions of MPI allow for broadcast routines which use a ”tree” method of commu- nications. The more common implementation found on most clusters, openMPI, LAM-MPI and MPICH, simply have the sending machine contact each receiving machine in turn.

2.9.4 Amdahl’s Law

Amdahl’s law, also known as Amdahl’s argument, is named after computer ar- chitect Gene Amdahl, and is used to find the maximum expected improvement to an overall system when only part of the system is improved. It is often used in parallel computing to predict the theoretical maximum speedup using multiple processors.

Figure 2.8: Speedup of a program using multiple processors

21 2.10 Maths Libraries

1 OverallSpeedup = f (1 − f) + s where, f-fraction of parallel code s-speedup of enhanced portion The speedup of a program using multiple proces- sors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the min- imum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20x, as the diagram illustrates.

2.10 Maths Libraries

For computer programmers, calling pre-written subroutines to do complex calcu- lations dates back to early computing history. With minimal effort, any developer can write a function that multiplies two matrices, but these same developers would not want to re-write that function for every new program that requires it. Fur- ther, with good theory and practice, one can optimize practically any algorithm to run several times faster, though it would typically take a several hours to days to match the performance of a highly optimized algorithm. Scientific computing, and the use of math libraries, was traditionally limited to research labs and engineering disciplines. In recent decades, this niche com- puting market has blossomed across a variety of industries. While research in- stitutes and universities are still the largest users of math libraries, especially in the High Performance Computing (HPC) arena, industries like financial services and biotechnology are increasingly turning to math libraries as well. Even the business analytics arena around business intelligence and data mining is starting to leverage the existing tools. From bond pricing and portfolio optimization to exotic instrument evaluations and exchange rate analysis, the financial services industry has a wide variety of requirements for complex mathematical algorithms. Similarly, the biology disciplines have aligned with statisticians to analyze exper- imental procedures which produce hundreds of thousands of results. The core area of the math library market implements linear algebra algorithms.

22 2.10 Maths Libraries

More specialized functions, such as numerical optimization and time series fore- casting, are often invoked explicitly by users. In contrast, linear algebra functions are often used as key background components for solving a wide variety of prob- lems. Eigen analysis, matrix inversion and other linear calculations are essential components in nearly every statistical analysis in use today including regression, factor analysis, discriminate analysis, etc. The most basic suite of such algorithms is the BLAS (Basic Linear Algebra Subprograms) libraries for basic vector and matrix operations.

BLAS BLAS is the Basic Linear Algebra Subprograms. It is a set of routines used to perform common low level matrix manipulations such as rotations, or dot prod- ucts. BLAS should be optimized to run on given hardware. This can be done by getting a vendor supplied package (ie, provided by Sun, or Intel), or else by using the ATLAS software.

ATLAS ATLAS is the Automatically Tuned Linear Algebra Software package. It is soft- ware that attempts to tune the BLAS implementation that it provides to hardware. ATLAS also provides a very minimal LAPACK implementation, so it is better to install the complete LAPACK package separately.

LAPACK LAPACK is the Linear Algebra Package. It extends BLAS to provide higher level linear algebra routines such as computing eigenvalues, or finding the solutions to a system of linear equations. LAPACK is a library of Fortran 77 subroutines for solving the most commonly occurring problems in numerical linear algebra. It has been designed to be efficient on a wide range of modern high-performance comput- ers. The name LAPACK is an acronym for Linear Algebra PACKage. Previously LINPACK was used for benchmarking. LINPACK is a collection of Fortran sub- routines that analyse and solve linear equations and linear least-squares problems. But now it is completely superseded by LAPACK.

Problems that LAPACK can Solve LAPACK can solve systems of linear equations, linear least squares problems, eigenvalue problems and singular value problems. LAPACK can also handle many associated computations such as matrix factorizations or estimating con-

23 2.11 HPL Benchmark dition numbers. LAPACK contains driver routines for solving standard types of problems, com- putational routines to perform a distinct computational task, and auxiliary rou- tines to perform a certain subtask or common low-level computation. Each driver routine typically calls a sequence of computational routines. Taken as a whole, the computational routines can perform a wider range of tasks than are covered by the driver routines. Many of the auxiliary routines may be of use to numerical analysts or software developers, so they documented the Fortran source for these routines with the same level of detail used for the LAPACK routines and driver routines. Dense and band matrices are provided for, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices.

2.11 HPL Benchmark

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark. The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution - Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths - Re- cursive panel factorization with pivot search and column broadcast combined - Various virtual panel broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution with look-ahead of depth 1. The HPL package provides a testing and timing program to quantify the ac- curacy of the obtained solution as well as the time it took to compute it. The best performance achievable by this software on system depends on a large variety of factors. Nonetheless, with some restrictive assumptions on the interconnection network, the algorithm described here and its attached implementation are scal- able in the sense that their parallel efficiency is maintained constant with respect to the per processor memory usage. The HPL software package requires the availability of an implementation of the Message Passing Interface MPI on system. An implementation of either the Basic Linear Algebra Subprograms BLAS or the Vector Signal Image Processing Library VSIPL is also needed. Machine-specific as well as generic implementations of MPI, the BLAS and VSIPL are available for a large variety of systems.

24 2.11 HPL Benchmark

2.11.1 Description of the HPL.dat File

Line 1: (unused) Typically one would use this line for its own good. For example, it could be used to summarize the content of the input file. By default this line reads: HPL Linpack benchmark input file

Line 2: (unused) same as line 1. By default this line reads: Innovative Computing Laboratory, University of Tennessee

Line 3: the user can choose where the output should be redirected to. In the case of a file, a name is necessary, and this is the line where one wants to specify it. Only the first name on this line is significant. By default, the line reads: HPL.out output file name (if any) This means that if one chooses to redirect the output to a file, the file will be called ”HPL.out”. The rest of the line is unused, and this space to put some informative comment on the meaning of this line.

Line 4: This line specifies where the output should go. The line is formatted, it must begin with a positive integer, the rest is unsignificant. 3 choices are possi- ble for the positive integer, 6 means that the output will go the standard output, 7 means that the output will go to the standard error. Any other integer means that the output should be redirected to a file, which name has been specified in the line above. This line by default reads: 6 device out (6=stdout,7=stderr,file) which means that the output generated by the executable should be redirected to the standard output.

Line 5: This line specifies the number of problem sizes to be executed. This number should be less than or equal to 20. The first integer is significant, the rest is ignored. If the line reads: 3 # of problems sizes (N) this means that the user is willing to run 3 problem sizes that will be specified in the next line.

Line 6: This line specifies the problem sizes one wants to run. Assuming the line above started with 3, the 3 first positive integers are significant, the rest is ignored. For example:

25 2.11 HPL Benchmark

3000 6000 10000 Ns means that one wants xhpl to run 3 (specified in line 5) problem sizes, namely 3000, 6000 and 10000.

Line 7: This line specifies the number of block sizes to be runned. This num- ber should be less than or equal to 20. The first integer is significant, the rest is ignored. If the line reads: 5 # of NBs this means that the user is willing to use 5 block sizes that will be specified in the next line.

Line 8: This line specifies the block sizes one wants to run. Assuming the line above started with 5, the 5 first positive integers are significant, the rest is ignored. For example: 80 100 120 140 160 NBs means that one wants xhpl to use 5 (specified in line 7) block sizes, namely 80, 100, 120, 140 and 160.

Line 9: This line specifies how the MPI processes should be mapped onto the nodes of platform. There are currently two possible mappings, namely row- and column-major. This feature is mainly useful when these nodes are themselves multi-processor computers. A row-major mapping is recommended. 0 PMAP process mapping (0=Row-,1=Column-major)

Line 10: This line specifies the number of process grid to be runned. This num- ber should be less than or equal to 20. The first integer is significant, the rest is ignored. If the line reads: 2 # of process grids (P x Q) this means that it will try 2 process grid sizes that will be specified in the next line.

Line 11-12: These two lines specify the number of process rows and columns of each grid to run on. Assuming the line above (10) started with 2, the 2 first positive integers of those two lines are significant, the rest is ignored. For example: 1 2 Ps 6 8 Qs means that one wants to run xhpl on 2 process grids (line 10), namely 1-by-6 and 2-by-8. Note: In this example, it is required then to start xhpl on at least 16

26 2.11 HPL Benchmark nodes (max of Pi-by-Qi). The runs on the two grids will be consecutive. If one was starting xhpl on more than 16 nodes, say 52, only 6 would be used for the first grid (1x6) and then 16 (2x8) would be used for the second grid. The fact that you started the MPI job on 52 nodes, will not make HPL use all of them. In this example, only 16 would be used. If one wants to run xhpl with 52 processes one needs to specify a grid of 52 processes, for example the following lines would do the job: 4 2 Ps 13 8 Qs

Line 13: This line specifies the threshold to which the residuals should be com- pared with. The residuals should be or order 1, but are in practice slightly less than this, typically 0.001. This line is made of a real number, the rest is not significant. For example: 16.0 threshold In practice, a value of 16.0 will cover most cases. For various reasons, it is possible that some of the residuals become slightly larger, say for example 35.6. xhpl will flag those runs as failed, however they can be considered as correct. A run should be considered as failed if the residual is a few order of magnitude bigger than 1 for example 106 or more. Note: if one was to specify a threshold of 0.0, all tests would be flagged as failed, even though the answer is likely to be correct. It is allowed to specify a negative value for this threshold, in which case the checks will be by-passed, no matter what the threshold value is, as soon as it is negative. This feature allows to save time when performing a lot of experiments, say for instance during the tuning phase. Example: -16.0 threshold

The remaning lines allow to specifies algorithmic features. xhpl will run all possible combinations of those for each problem size, block size, process grid com- bination. This is handy when one looks for an ”optimal” set of parameters. To understand a little bit better, let say first a few words about the algorithm imple- mented in HPL. Basically this is a right-looking version with row-partial pivoting. The panel factorization is matrix-matrix operation based and recursive, dividing the panel into NDIV subpanels at each step. This part of the panel factorization is denoted below by ”recursive panel fact. (RFACT)”. The recursion stops when the current panel is made of less than or equal to NBMIN columns. At that point, xhpl uses a matrix-vector operation based factorization denoted below by ”PFACTs”.

27 2.11 HPL Benchmark

Classic recursion would then use NDIV=2, NBMIN=1. There are essentially 3 nu- merically equivalent LU factorization algorithm variants (left-looking, Crout and right-looking). In HPL, one can choose every one of those for the RFACT, as well as the PFACT. The following lines of HPL.dat allows to set those parameters. Lines 14-21: (Example 1) 3 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 4 # of recursive stopping criterium 1 2 4 8 NBMINs (>= 1) 3 No. of panels in recursion 2 3 4 NDIVs 3 No. of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) This example would try all variants of PFACT, 4 values for NBMIN, namely 1, 2, 4 and 8, 3 values for NDIV namely 2, 3 and 4, and all variants for RFACT. Lines 14-21: (Example 2) 2 # of panel fact 2 0 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 4 8 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 2 RFACTs (0=left, 1=Crout, 2=Right) This example would try 2 variants of PFACT namely right looking and left look- ing, 2 values for NBMIN, namely 4 and 8, 1 value for NDIV namely 2, and one variant for RFACT.

In the main loop of the algorithm, the current panel of column is broadcast in process rows using a virtual ring topology. HPL offers various choices and one most likely want to use the increasing ring modified encoded as 1. 3 and 4 are also good choices. Lines 22-23: (Example 1) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) This will cause HPL to broadcast the current panel using the increasing ring mod- ified topology.

28 2.11 HPL Benchmark

Lines 22-23: (Example 2) 2 # of broadcast 0 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) This will cause HPL to broadcast the current panel using the increasing ring vir- tual topology and the long message algorithm.

Lines 24-25 allow to specify the look-ahead depth used by HPL. A depth of 0 means that the next panel is factorized after the update by the current panel is completely finished. A depth of 1 means that the next panel is immediately fac- torized after being updated. The update by the current panel is then finished. A depth of k means that the k next panels are factorized immediately after being updated. The update by the current panel is then finished. It turns out that a depth of 1 seems to give the best results, but may need a large problem size before one can see the performance gain. So use 1, if you do not know better, otherwise you may want to try 0. Look-ahead of depths 3 and larger will probably not give better results. Lines 24-25: (Example 1): 1 No. of lookahead depth 1 DEPTHs (>= 0) This will cause HPL to use a look-ahead of depth 1. Lines 24-25: (Example 2): 2 No. of lookahead depth 0 1 DEPTHs (>= 0) This will cause HPL to use a look-ahead of depths 0 and 1.

Lines 26-27 allow to specify the swapping algorithm used by HPL for all tests. There are currently two swapping algorithms available, one based on ”binary ex- change” and the other one based on a ”spread-roll” procedure (also called ”long” below). For large problem sizes, this last one is likely to be more efficient. The user can also choose to mix both variants, that is ”binary-exchange” for a number of columns less than a threshold value, and then the ”spread-roll” algorithm. This threshold value is then specified on Line 27. Lines 26-27: (Example 1): 1 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm. Note that a threshold is specified in that example but not used by HPL.

29 2.11 HPL Benchmark

Lines 26-27: (Example 2): 2 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold This will cause HPL to use the ”long” or ”spread-roll” swapping algorithm as soon as there is more than 60 columns in the row panel. Otherwise, the ”binary- exchange” algorithm will be used instead.

Line 28 allows to specify whether the upper triangle of the panel of columns should be stored in no-transposed or transposed form. Example: 0 L1 in (0=transposed,1=no-transposed) form

Line 29 allows to specify whether the panel of rows U should be stored in no- transposed or transposed form. Example: 0 U in (0=transposed,1=no-transposed) form

Line 30 enables / disables the equilibration phase. This option will not be used unless 1 or 2 are selected in Line 26. Example: 1 Equilibration (0=no,1=yes)

Line 31 allows to specify the alignment in memory for the memory space allo- cated by HPL. On modern machines, one probably wants to use 4, 8 or 16. This may result in a tiny amount of memory wasted. Example: 8 memory alignment in double (> 0)

2.11.2 Guidelines for HPL.dat configuration

1. Figure out a good block size for the matrix multiply routine. The best method is to try a few out. If the block size used by the matrix-matrix multiply routine is known, a small multiple of that block size will do fine. This particular topic is discussed in the FAQs section.

2. The process mapping should not matter if the nodes of platform are sin- gle processor computers. If these nodes are multi-processors, a row-major mapping is recommended.

3. HPL likes ”square” or slightly flat process grids. Unless very small process grid is used, stay away from the 1-by-Q and P-by-1 process grids. This particular topic is also discussed in the FAQs section.

30 2.11 HPL Benchmark

4. Panel factorization parameters: a good start are the following for the lines 14-21: 1 No. of panel fact 1 PFACTs (0=left, 1=Crout, 2=Right) 2 No. of recursive stopping criterium 4 8 NBMINs (>= 1) 1 No. of panels in recursion 2 NDIVs 1 No. of recursive panel fact. 2 RFACTs (0=left, 1=Crout, 2=Right)

5. Broadcast parameters: at this time it is far from obvious to me what the best setting is, so i would probably try them all. If I had to guess I would probably start with the following for the lines 22-23: 2 No. of broadcast 1 3 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) The best broadcast depends on problem size and harware performance. Usu- ally 4 or 5 may be competitive for machines featuring very fast nodes com- paratively to the network.

6. Look-ahead depth: as mentioned above 0 or 1 are likely to be the best choices. This also depends on the problem size and machine configuration, so I would try ”no look-ahead (0)” and ”look-ahead of depth 1 (1)”. That is for lines 24-25: 2 No. of lookahead depth 0 1 DEPTHs (>= 0)

7. Swapping: one can select only one of the three algorithm in the input file. Theoretically, mix (2) should win, however long (1) might just be good enough. The difference should be small between those two assuming a swap- ping threshold of the order of the block size (NB) selected. If this threshold is very large, HPL will usebinexch(0) most of the time and if it is very small (< NB) − 27: 2 SWAP (0=bin-exch,1=long,2=mix) 60 swapping threshold I would also try the long variant. For a very small number of processes in every column of the process grid (say ¡ 4), very little performance difference should be observable.

31 2.12 HPCC Challenge Benchmark

8. Local storage: I do not think Line 28 matters. Pick 0 in doubt. Line 29 is more important. It controls how the panel of rows should be stored. No doubt 0 is better. The caveat is that in that case the matrix-multiply function is called with ( Notrans, Trans, ... ), that is C := C − ABT . Unless the computational kernel used has a very poor (with respect to performance) implementation of that case, and is much more efficient with ( Notrans, Notrans, ... ) just pick 0 as well. So, the choice: 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form

9. Equilibration: It is hard to tell whether equilibration should always be per- formed or not. Not knowing much about the random matrix generated and because the overhead is so small compared to the possible gain, I turn it on all the time. 1 Equilibration (0=no,1=yes)

10. For alignment, 4 should be plenty, but just to be safe, one may want to pick 8 instead. 8 memory alignment in double (> 0)

2.12 HPCC Challenge Benchmark

HPCC was developed to study future Petascale computing systems, and is in- tended to provide a realistic measurement of modern computing workloads. HPCC is made up of seven common computational kernels: STREAM, HPL, DGEMM (matrix multiply), PTRANS (parallel matrix transpose), FFT, RandomAccess, and b eff (bandwidth/latency tests). The benchmarks attempt to measure high and low spatial and temporal locality space. The tests are scalable, and can be run on a wide range of platforms, from single processors to the largest parallel supercomputers. The HPCC benchmarks test three particular regimes: local or single processor, embarrassingly parallel, and global, where all processors compute and exchange data with each other. STREAM measures a processor’s memory bandwidth. HPL is the LINPACK TPP (Toward Peak Performance) benchmark; RandomAccess measures the rate of random updates of memory; PTRANS measures the rate of transfer of very large arrays of data from memory; b eff measures the latency and bandwidth of increasingly complex communication patterns. All of the benchmarks are run in two modes: base and optimized. The base

32 2.12 HPCC Challenge Benchmark

run allows no source modifications of any of the benchmarks, but allows gener- ally available optimized libraries to be used. The optimized benchmark allows significant changes to the source code. The optimizations can include alternative programming languages and libraries that are specifically targeted for the platform being tested. The HPC Challenge benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b eff Latency/Bandwidth.

HPL ( system performance ) The Linpack TPP benchmark which measures the floating point rate of execu- tion for solving a randomly generated dense linear system of equations in double floating-point precision (IEEE 64-bit) arithmetic using MPI. The linear system matrix is stored in a two-dimensional block-cyclic fashion and multiple variants of code are provided for computational kernels and communication patterns. The solution method is LU factorization through Gaussian elimination with partial row pivoting followed by a backward substitution. Unit: Tera Flops per Second

PTRANS (A = A + BT ) ( system performance ) Implements a parallel matrix transpose for two-dimensional block-cyclic storage. It is an important benchmark because it exercises the communications of the com- puter heavily on a realistic problem where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network. Unit: Giga Bytes per Second

RandomAccess ( system performance ) Global RandomAccess, also called GUPs, measures the rate at which the com- puter can update pseudo-random locations of its memory - this rate is expressed in billions (giga) of updates per second (GUP/s). Unit: Giga Updates per Second

FFTE ( system performance ) IT measures the floating point rate of execution of double precision complex one- dimensional Discrete Fourier Transform (DFT). Global FFTE performs the same test as FFTE but across the entire system by distributing the input vector in block fashion across all the processes. Unit: Giga Flops per Second

STREAM ( system performance - derived ) The Embarrassingly Parallel STREAM benchmark is a simple synthetic bench-

33 2.12 HPCC Challenge Benchmark mark program that measures sustainable memory bandwidth and the correspond- ing computation rate for simple numerical vector kernels. It is run in embarrass- ingly parallel manner - all computational processes perform the benchmark at the same time, the arithmetic average rate is multiplied by the number of processes for this value. ( EP-STREAM Triad * MPI Processes ) Unit: Giga Bytes per Second

DGEMM ( per process ) The Embarrassingly Parallel DGEMM benchmark measures the floating-point ex- ecution rate of double precision real matrix-matrix multiply performed by the DGEMM subroutine from the BLAS (Basic Linear Algebra Subprograms). It is run in embarrassingly parallel manner - all computational processes perform the benchmark at the same time, the arithmetic average rate is reported. Unit: Giga Flops per Second

Effective bandwidth benchmark (b eff) Effective bandwidth benchmark a set of tests to measure latency and bandwidth of a number of simultaneous communication patterns.

Random Ring Bandwidth ( per process ) Randomly Ordered Ring Bandwidth, reports bandwidth achieved in the ring com- munication pattern. The communicating processes are ordered randomly in the ring (with respect to the natural ordering of the MPI default communicator). The result is averaged over various random assignments of processes in the ring. Unit: Giga Bytes per second

Random Ring Latency ( per process ) Randomly-Ordered Ring Latency, reports latency in the ring communication pat- tern. The communicating processes are ordered randomly in the ring (with respect to the natural ordering of the MPI default communicator) in the ring. The re- sult is averaged over various random assignments of processes in the ring. Unit: micro-seconds

Giga-updates per second (GUPS) is a measure of computer performance. GUPS is a measurement of how frequently a computer can issue updates to randomly generated RAM locations. GUPS measurements stress the latency and especially bandwidth capabilities of a machine.

34 Chapter 3

Design and Implementation

3.1 Beowulf Clusters: A Low cost alternative

Beowulf is not a particular product. It is a concept for clustering varying numbers of small, relatively inexpensive computers running the Linux operating system. The goal of Beowulf clustering is to create a parallel-processing supercomputer environment at a price well below that of conventional supercomputers.

Figure 3.1: The Schematic structure of proposed cluster

A Beowulf Cluster is a PC cluster that normally runs under Linux OS. Each PC (node) is dedicated to the work of the cluster and connected through a net- work with other nodes. Figure 3.1 schematically shows the structure of a proposed cluster. In this cluster, a master node controls other worker nodes by communicat- ing through the network using the Message Passing Interface (MPI). A Proposed

35 3.2 Logical View of proposed Cluster

cluster will have better price/performance ratio and scalability than other parallel computers due to the use of off-the-shelf components and Linux OS. It is easy and economical to add more nodes as needed without changing software programs.

3.2 Logical View of proposed Cluster

The primary and most often used view is termed logical view and this is the view that anybody is generally be interacting with when using a cluster. In this view, the physical components are categorized and displayed in a layered manner, that is, here the primary concern is the parallel applications, message passing library, OS and interconnect.

Figure 3.2: Logical view of proposed cluster

3.3 Hardware Configuration

As it has been previously indicated a cluster is comprised of computers intercon- nected through a LAN. Let’s talk first about the requirements of this cluster in terms of hardware and then about the software that will run on this system.

3.3.1 Master Node

The master server provide access to the primary network and ensure availability of the cluster. Server has Fast Ethernet connection to the network in order to better keep up with the high speed of the PCs. Any system from Intel, AMD or

36 3.3 Hardware Configuration other vendor can be used as server. Here PC with Intel i7-2600 processor and 4 GB RAM is used as server.

3.3.2 Compute Nodes

Build custom PCs from commodity off-the-shelf components requires a lot of work to assemble the cluster but it can be fine tuned as per the need. Buy generic PCs and shelves. May want keyboard switches for smaller configurations. For larger configurations, a better solution would be use the serial ports from each machine and connect to a terminal server. Or even custom rack-mount nodes can be used. More expensive but saves space. May complicate cooling issues due to closely packed components. For this complete setup use old unused PCs from the college. Here for testing purpose similar PCs with Intel i7-2600 processor and 4 GB RAM are used.

3.3.3 Network

As it has been previously indicated the computers in a cluster communicate us- ing a network interconnection as can be seen in the Figure 3.3. The master and the compute nodes have NICs and all the computers are connected to a switch to perform the delivery of messages. The cost per port of an Ethernet Switch is about four times larger than an Ethernet Hub but an Ethernet Switch will be used due to the following reasons: An Ethernet Hub is a network device that acts as a broadcast bus, where an input signal is amplified and distributed to all ports. However only a couple of computers can communicate properly at once and if two or more computers simultaneously send packets a collision will occur. Therefore, the bandwidth of an Ethernet Hub is equivalent to the bandwidth of the communi- cation link, 10Mb/s for standard Ethernet, 100Mb/s for Fast Ethernet and 1Gb/s for Gigabit Ethernet. An Ethernet Switch provides more accumulated bandwidth by allowing multiple simultaneous communications. If there are no conflicts in the output ports, the Ethernet Switch can send multiple packets simultaneously. A major disadvantage that clusters have compared to supercomputers is its latency. The bandwidth of each computer could be increased using multiple NICs, which is possible through what is known in Linux as Channel Bonding. It consists in the simulation of a network interface linking multiple NICs so that applications will only see a single interface. The access to the cluster is often made remotely, that is the reason why the frontend will have two NICs, one to access the Internet and another one to connect to other nodes in the cluster. The maximum bandwidth

37 3.4 Softwares provided by the college end Ethernet is 100 Mb/s and minimum latency for fast Ethernet is 80 microseconds. All cluster machines are connected through college’s Ethernet.

Figure 3.3: The Network interconnection

3.4 Softwares

The system that has been designed and implemented uses the Linux kernel with GNU applications. These applications range from servers and compilers.

1. Operating System: The operating system used is Linux based CentOS 6.2. It is an enterprise-quality operating system, because it is based on the source code of Red Hat Enterprise Linux, which has been tested and stabilized ex- tensively prior to release. On the other hand, CentOS(Community ENTer- prise Operating System) is completely free, open source, and no cost, offering all of the user support and features of a community-run Linux distribution. The version 6 has been chosen because it is the latest stable version. The op- erating system that runs in the frontend includes the standard applications of the distribution in addition to others required for the construction of the cluster. The specific applications included for the construction of the cluster are message-passing libraries, compilers, servers and software for monitoring the resources of the cluster.

2. Message-passing libraries: In the parallel computation in order to perform task resolutions and intensive calculations one must divide and distribute independent tasks to the different computers using the message-passing li- braries. There are several libraries of this type, the most well-known being MPI and PVM (Parallel Virtual Machine). The system integrates MPI.

38 3.4 Softwares

The reason for this choice is that it is the most commonly used library by the numerical analysis community for the passing of messages. Specifically MPICH2 has been used in proposed system.

3. Compilers: Languages commonly used in parallel computing are C, C++, Python and FORTRAN. For this reason the four programming languages are supported within the system that has been developed integrating the compilers gcc, g++ and gfortran.

4. Compute nodes: The operating system that runs in the nodes is basic Cen- tOS 6.2 without GUI. It integrates the kernel and basic services which are necessary for an adequate performance of the nodes. The unnecessary soft- wares which are not needed for this purpose has been discarded. Therefore, MPICH2 is included, as well as the compilers gcc, g + + and gfortran.

3.4.1 MPICH2

MPICH2 is architected so that a number of communication infrastructures can be used. These are called ”devices.” The device that is most relevant for the Beowulf environment is the channel device (also called ”ch3” because it is the third version of the channel approach for implementing MPICH); this supports a variety of communication methods and can be built to support the use of both TCP over sockets and shared memory. In addition, MPICH2 uses a portable in- terface to process management systems, providing access both to external process managers (allowing the process managers direct control over starting and running the MPI processes) and to the MPD scalable process manager that is included with MPICH2. To run first MPI program, carry out the following steps for its installation:

1. Download mpich2-1.4.1p1.tar.gz from www.mcs.anl.gov/mpi/mpich and copy at /home/beowulf/sw/

2. Extract the contents in /home/beowulf/sw/ $tar xvfz mpich2-1.4.1p1.tar.gz

3. Create folder for installation $mkdir /opt/mpich2-1.4.1p1

4. Create build directory $mkdir /tmp/mpich2-1.4.1p1 $cd/tmp/mpich2-1.4.1p1

39 3.4 Softwares

5. configure ¡configure options¿ ¿& configure.log. Most users should specify a prefix for the installation path when configuring: $/home/beowulf/sw/mpich2-1.4.1p1/configure –prefix=/opt/mpich2-1.4.1p1 2 > &1 configure.log

6. By default, this creates the channel device for communication with TCP over sockets. Now build. $make 2 > &1 make.log

7. Install MPICH2 commands $make install 2 > &1 install.log

8. Add the ’< prefix > /bin’ directory to path by adding below line in $home/.bashrc file in home directory $vi home/.bashrc export PATH=< prefix > /bin : $P AT H

9. Test mpich2 installation $which mpicc

SSH login without password

Public key authentication allows to login to a remote host via the SSH proto- col without a password and is more secure than password-based authentication. Try creating a passwordless connection from master to node1 using public-key authentication.

Create key

Press ENTER at every prompt. [root@master]#ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/user/.ssh/id rsa. Your public key has been saved in /home/user/.ssh/id rsa.pub. The key fingerprint is: b2:ad:a0:80:85:ad:6c:16:bd:1c:e7:63:4f:a0:00:15 user@host The key’s randomart image is:

40 3.4 Softwares

[root@master]#

For added security ”’the key itself”’ would be protected using a strong ”passphrase”. If a passphrase is used to protect the key, ssh-agent can be used to cache the passphrase.

Copy key to remote host

[root@master]# ssh-copy-id root@node1 root@node1’s password: Now try logging into the machine, with ”ssh ’root@node1’”, and check in: .ssh/authorized keys to make sure we haven’t added extra keys that you weren’t expecting. [root@master]#

Login to remote host

Note that no password is required. root@master# ssh root@node1 Last login: Tue May 18 12:47:53 2012 from 10.1.11.210 [root@node1]#

Also it is must to disable the firewall on all cluater machines so that the cluster can work seamlessly. To achieve this first login as the root user. Next enter the following three commands to disable firewall. #service iptables save #service iptables stop #chkconfig iptables off Now MPI programs can be run on the cluster.

Running MPI Program

The following assumes that MPICH2 is installed on all cluster machines running CentOS 6.2 and every machine have access to every other via mpiexec command. It is also assumed that to compile, copy, and run the code command line, or in a terminal window is used. Typically, running an MPI program will consist of three steps:

41 3.4 Softwares

Compile

Assuming that the code to compile is ready (If only binary executables are avail- able, proceed to step 2, copy) it need to create an executable. This involves compiling code with the appropriate compiler, linked against the MPI libraries. It is possible to pass all options through a standard cc or f77 command, but MPICH provides a ”wrapper” (mpicc for cc/gcc, mpicxx/mpic++ for c++/g++ on UNIX/Linux and mpif77 for f77) that appropriately links against the MPI li- braries and sets the appropriate include and library paths.

Example: hello.c (Use vi text editor to create the file hello.c) #include < stdio.h > #include < mpi.h > int main(int argc, char ** argv) { int rank, size; char name[80]; int length; MPI Init(&argc, &argv); // note that argc and argv are passed by address MPI Comm rank(MPI COMM WORLD,&rank); MPI Comm size(MPI COMM WORLD,&size); MPI Get processor name(name,&length); printf(”Hello MPI: processor %d of %d on %s\n”, rank,size,name); MPI Finalize(); } After saving the above example file, compile the program using the mpicc command. $mpicc -o hello hello.c The ”-o” option provides an output file name, otherwise executable would be saved as ”a.out”. Be careful to make sure to provide an executable name if ”-o” option is used. Many programmers have deleted part of their source code by accidentally giving their source code as their output file name. If file name is typed correctly and there are no bugs in the code, it will successfully compile the code, and an ”ls” command should show that the output file ”hello” is created. $ls hello.c $mpicc -o hello hello.c $ls hello hello.c hello

42 3.4 Softwares

Copy

In order for program to run on each node, the executable must exist on each node. There are as many ways to make sure that executable exists on all of the nodes as there are ways to put the cluster together in the first place. One method is coverd below. This method will assume that there exists a directory (/home/beowulf/testing) on all the nodes, and authentication is being done via ssh, and that public keys have been shared for the account to allow for login and remote execution without a password. One command that can be used to copy files between machines is ”scp”; scp is a unix command that will securely copy files between remote machines, and in its simplest use acts as a secure remote copy. It takes similar arguments to the unix ”cp” command. Now save example in a directory /home/beowulf/testing (i.e. the file is saved as /home/beowulf/testing/hello) the following command will copy the file hello to a remote node. $scp hello root@node1:/home/beowulf/testing This will need to be done for each host. To check whether copy is working properly or not, ssh into each host, and check to see that the files are there using the ”ls” command.

Execute

Once compiled the code and copied it to all of the nodes, run it using the mpiexec command. Two of the more common arguments to the mpiexec command are the ”np or n” argument that specify how many processors to use, and the ”-f” argument specify exactly which nodes are available for use. Already an entry has been made for hosts file in .bashrc in home directory so there is no need to use this argument. Change directory to the file where executable is located, and run hello com- mand using 4 processes: $mpiexec -n 4 ./hello Hello MPI: processor 0 of 4 on master Hello MPI: processor 3 of 4 on node3 Hello MPI: processor 2 of 4 on node2 Hello MPI: processor 1 of 4 on node1

43 3.4 Softwares

3.4.2 HYDRA: Process Manager

Hydra is a process management system for starting parallel jobs. Hydra is de- signed to natively work with multiple daemons such as ssh, rsh, pbs, slurm and sge. Starting MPICH2-1.3, hydra is the default process manager, which is auto- matically used with mpiexec. As there is a bug with hydra-1.4 which comes with mpich2-1.4.1p1, hydra- 1.5b1 has been installed separately. Once built, the new Hydra executables are in mpich2/bin, or the bin subdirectory of the install directory if install have been done. Put this (bin) directory in PATH in .bashrc for usage convenience: Put in .bashrc: export PATH=/opt/mpich2-1.4.1p1/bin/bin:$PATH HYDRA HOST FILE: This variable points to the default host file to use, when the ”-f” option is not provided to mpiexec. For bash: export HYDRA HOST FILE=< path to host file >/hosts

3.4.3 TORQUE: Resource Manager

TORQUE Resource Manager provides control over batch jobs and distributed computing resources. It is an advanced The TORQUE Resource Manager is a distributed resource manager providing control over batch jobs and distributed compute nodes. Its name stands for Terascale Open-Source Resource and QUEue Manager. Cluster Resources, Inc. describes it as open-source and Debian classifies it as non-free owing to issues with the license. It is a community effort based on the original PBS project and, with more than 1,200 patches, has incorporated sig- nificant advances in the areas of scalability, fault tolerance, and features extensions contributed by NCSA, OSC, USC, the US DOE, Sandia, PNNL, UB, TeraGrid, and many other leading-edge HPC organizations. TORQUE can integrate with the non-commercial Maui Cluster Scheduler or the commercial Moab Workload Manager to improve overall utilization, scheduling and administration on a clus- ter. TORQUE is described by its developers as open-source software, using the OpenPBS version 2.3 license and as non-free software in the Debian Free Software Guidelines.

Feature Set

TORQUE provides enhancements over standard OpenPBS in the following areas:

44 3.4 Softwares

Fault Tolerance

• Additional failure conditions checked/handled

• Node health check script support

Scheduling Interface

• Extended query interface providing the scheduler with additional and more accurate information

• Extended control interface allowing the scheduler increased control over job behavior and attributes

• Allows the collection of statistics for completed jobs

Scalability

• Significantly improved server to Message oriented middleware (MOM) com- munication model

• Ability to handle larger clusters (over 15 TF/2,500 processors)

• Ability to handle larger jobs (over 2000 processors)

• Ability to support larger server messages

Usability

• Extensive logging additions

• More human readable logging (i.e. no more ’error 15038 on command 42’)

3.4.4 MAUI: Cluster Scheduler

Maui Cluster Scheduler is a open source job scheduler for use on clusters and supercomputers initially developed by Cluster Resources, Inc.. Maui is capable of supporting multiple scheduling policies, dynamic priorities, reservations, and fairshare capabilities. Maui satisfies some definitions of open-source software and is not available for commercial usage. It improves the manageability and effi- ciency of machines ranging from clusters of a few processors to multi-teraflops supercomputers.

45 3.5 System Considerations

Job State

Jobs in Maui can be in one of three major states:

Running

A jobs that have been alloted its required resources and have started its compu- tation is considered running until it finish.

Queued (idle)

Jobs that are eligible to run. The priority is calculated here and the jobs are sorted according to calculated priority. Advance reservations are made starting with the job up front.

Non-queued

Jobs that, for some reason, are not allowed to start. Jobs in this state does not gain any queue-time priority. There is a limit on the number of jobs a group/user can have in the Queued state. This prohibit users from acquiring longer queue-time than deserved by submitting large number of jobs.

3.5 System Considerations

The following sections discuss system considerations and requirements:

Design/development Debug There are a number of critical tools necessary for the implementation of a success- ful HPCC cluster solution. The first is a compiler which can take advantage of the architectural features of the processor. Next, a debugger such as gdb allows the developer to debug the code and assists in finding the problem areas or sections of code to be further tuned for performance. A profiler is also necessary to assist in finding the performance bottlenecks in the overall system including the system interconnect.

Job Control Once an application has been developed or ported to a Beowulf cluster, the ap- plication must be started and run on a portion or the entire cluster. Understand

46 3.5 System Considerations particular needs and requirements for system partitioning, how jobs are started and run, and how a queue of jobs can be setup to run automatically.

Checkpoint Restart Many applications running on even very large HPC clusters will require many hours, days, or weeks of execution time to run to completion. A failure in one part of the system could corrupt a job execution run, forcing a restart. The solu- tion is to periodically checkpoint the current state, writing the intermediate data calculations available at the end of the interval to a disk subsystem. This usually takes a small amount of time to write out the data with the compute functions temporarily paused, the time dependent on the storage architecture. If there is a system failure of one of the computing components, then the failing component can be taken out of the cluster and the job restarted with the data available from the previous periods checkpoint save.

Performance Monitoring Even if a considerable amount of time is spent during the debug phase to tune the application for best performance, a performance monitoring function is still necessary to watch the cluster performance over time. With potentially multiple job streams running concurrently on the system, each taking differing amounts of CPU or memory, there may be situations where the applications are not running at the expected efficiency. The performance-monitoring tool can assist in detect- ing these situations.

Benchmarking An excellent collection of benchmarks is the HPCC Benchmarking Suite. It con- sists of seven well-known public domain benchmarks. The latest version allows to compare network performance with raw TCP, PVM, and MPICH, LAM/MPI among others. It is also worthwhile to use the latest version of HPL (High Per- formance Linpack) benchmark. For parallel benchmarks, the above mentioned benchmarks are a reasonable test (especially if running numerical computations on cluster). The above and other benchmarks are necessary to evaluate different architectures, motherboards, network cards.

47 Chapter 4

Experiments

To evaluate the usage and acceptability of the cluster and its performance few parallel programs are implemented. The first one is a finding the prime numbers in given range. The second is to calculate the value of π. Then one embarrassingly parallel program to solve circuit satisfiability problem is tested. Implemented 1D Time Dependent Heat Equation and Radix-2 FFT algorithms as a real life programs. Also conducted two standard benchmarking experiments which are also used to find the performance of Top500 supercomputers. The first of them is High Per- formance Linpack Benchmark and the other one is the HPCC which is a complete suite of seven tests covering many performance factors. The work of a global problem can be divided into a number of independent tasks, which rarely need to synchronize. Monte Carlo simulations or numerical integration are examples of this. So here in below examples the code that can be parallelized is found and then it is executed simultaneously on different cluster node with different data. If the parallelizable code is not depend on the other output of other nodes we get a better performance. The essence is to divide the entire computation evenly among collaborative processors. Divide and conquer.

4.1 Finding Prime Numbers

This C program counts the number of primes between 1 and N, using MPI to carry out the calculation in parallel. The algorithm is completely naive. For each integer I, it simply checks whether any smaller J evenly divides it. The total amount of work for a given N is thus roughly proportional to 1/2 ∗ N2. Figure 4.1 shows the performance of cluster for finding various primes as compared to single machine. This program is mainly a starting point for investigations into parallelization.

48 4.2 PI Calculation

Figure 4.1: Graph showing performance for Finding Primes

Here the total range of numbers for which we want to find the primes are divided into equal parts and then distributed amongst the computing nodes. Every node has to carry out its task and send back the results to master node. At last its the job of master node to combine the results of all the nodes and give the final result.

4.2 PI Calculation

The number π is a mathematical constant that is the ratio of a circle’s circumfer- ence to its diameter. The constant, sometimes written pi, is approximately equal to 3.14159. It calculate the value of π using:

R 1 4 0 1+x2 dx = π

Then compare the calculated π value with the original one and find out the ac- curacy of the output. Also the time taken by program to calculate it is also displayed. Figure 4.2 shows the time taken by different no. of PCs to calculate π. To parallelize the code identify the part(s) of a sequential algorithm that can be executed in parallel. This is the difficult part, then distribute the global work and data among cluster nodes. Here we can parallely run different iterations of N(no. of rectangles) from the code

49 4.3 Circuit Satisfiability Problem

Figure 4.2: Graph showing performance for Calculating π

4.3 Circuit Satisfiability Problem

CSAT is a C program which demonstrates, for a particular circuit, an exhaustive search for solutions of the circuit satisfy problem. This version of the program uses MPI to carry out the solution in parallel. This problem assumes that a logical circuit of AND, OR and NOT gates is given, with N binary inputs and a single output. Determine all inputs which produce a 1 as the output. The general problem is NP complete, so there is no known polynomial-time algorithm to solve the general case. The natural way to search for solutions then is exhaustive search. In an interesting way, this is a very extreme and discrete version of the problem of maximizing a scalar function of multiple variables. The difference is that here it is known that both the input and output only have the values 0 and 1, rather than a continuous range of real values! This problem was a natural candidate for parallel computation, since the in- dividual evaluations of the circuit are completely independent. So the complete problem domain is divided into equal parts and then respective nodes will perform there work to get the final results

50 4.4 1D Time Dependent Heat Equation

Figure 4.3: Graph showing performance for solving C-SAT Problem

4.4 1D Time Dependent Heat Equation

The heat equation is an important partial differential equation which describes the distribution of heat (or variation in temperature) in a given region over time. This program solves

∂u ∂2 − k ∗ = f(x, t) ∂t ∂x2 over the interval [A,B] with boundary conditions

u(A, t) = uA(t), u(B, t) = uB(t), over the time interval [t0, t1] with initial conditions

u(x, t0) = u0(x)

4.4.1 The finite difference discretization

To apply the finite difference method, define a grid of points x(1) through x(n), and a grid of times t(1) through t(m). In the simplest case, both grids are evenly spaced. The approximate solution at spatial point x(i) and time t(j) is denoted by u(i,j).

51 4.4 1D Time Dependent Heat Equation

Figure 4.4: Graph showing performance for solving 1D Time Dependent Heat Equation

A second order finite difference can be used to approximate the second deriva- tive in space, using the solution at three points equally separated in space. A forward Euler approximation to the first derivative in time is used, which relates the value of the solution to its value at a short interval in the future. Thus, at the spatial point x(i) and time t(j), the discretized differential equa- tion defines a relationship between u(i-1,j), u(i,j), u(i+1,j) and the ”future” value u(i,j+1). This relationship can be drawn symbolically as a four node stencil:

Figure 4.5: Symbolic relation between four nodes

Since the value of the solution at the initial time is given, use the stencil, plus the boundary condition information, to advance the solution to the next time step. Repeating this operation gives us an approximation to the solution at every point in the space-time grid.

52 4.5 Fast Fourier Transform

4.4.2 Using MPI to compute the solution

To solve the 1D heat equation using MPI, use a form of domain decomposition. Given P processors, divide the interval [A,B] into P equal subintervals. Each processor can set up the stencil equations that define the solution almost inde- pendently. The exception is that every processor needs to receive a copy of the solution values determined for the nodes on its immediately left and right sides. Thus, each processor uses MPI to send its leftmost solution value to its left neighbour, and its rightmost solution value to its rightmost neighbour. Of course, each processor must then also receive the corresponding information that its neigh- bours send to it. (However, the first and last processor only have one neighbour, and use boundary condition information to determine the behaviour of the solution at the node which is not next to another processor’s node.) The naive way of setting up the information exchange works, but can be in- efficient, since each processor sends a message and then waits for confirmation of receipt, which can’t happen until some processor has moved to the ”receive” stage, which only happens because the first or last processor doesn’t have to receive information on a given step.

4.5 Fast Fourier Transform

To make the DFT operation more practical, several FFT algorithms were pro- posed. The fundamental approach for all of them is to make use of the properties of the DFT operation itself. All of them reduce the computational cost of per- forming the DFT on the given input sequence.

n −j2πkn/N WN = e

This value of W n is referred to as the twiddle factor or phase factor. This value of twiddle factor being a trigonometric function over discrete points around the 4 quadrants of the two dimensional plane has some symmetry and periodicity prop- erties.

k+N/2 k Symmetry Property: WN = −WN

k+N k Periodicty Property: WN = WN

53 4.5 Fast Fourier Transform

Figure 4.6: Graph showing performance Radix-2 FFT algorithm

Using these properties of the twiddle factor, unnecessary computations can be eliminated. Another approach that can be used is the divide-and-conquer approach. In this approach, the given single dimensional input sequence of length, N, can be represented in a twodimensional form with M rows and L columns with N = M x L. It can be shown that DFT that is performed on such a representation will lead to lesser computations, N(M+L+1) complex additions and N(M+L-2) complex additions. Please note that this approach is applicable only when the value of N is composite.

4.5.1 Radix-2 FFT algorithm

This algorithm is a special case of the approaches described earlier in which N can be represented as a power of 2 i.e., N = 2v. This means that the number of complex additions and multiplications gets reduced to N(N+6)/2 and N2/2 just by using the divide and conquer approach. When the symmetry and periodicity property of the twiddle factor is used, it can be shown that the number of com- plex additions and multiplications can be reduced to Nlog2N and (N/2)log2N respectively. Hence, from a O(N2) algorithm, the computational complexity has been reduced to O(NlogN). The entire process is divided into log2N stages and in each stage N/2 two-point DFTs are performed. The computation involving each pair of data is called a butterfly. Radix-2 algorithm can be implemented as Decimation- in-time (M=N/2 and L=2) or Decimation in frequency (M=2 and L=N/2) algorithms.

54 4.6 Theoretical Peak Performance

Figure 4.7: 8-point Radix-2 FFT: Decimation in frequency form

Figure 4.7 gives the decimation-infrequency form of the Radix-2 algorithm for an input sequence of length, N=8.

4.6 Theoretical Peak Performance

The theoretical peak is based not on an actual performance from a benchmark run, but on a paper computation to determine the theoretical peak rate of execution of floating point operations for the machine. This is the number manufacturers often cite; it represents an upper bound on performance. That is, the manufacturer guarantees that programs will not exceed this rate for a given computer. To calculate theoretical peak performance of the HPC system, first it required to calculate theoretical peak performance of one node (server) in GFlops and than just to multiply node performance on the number of nodes of HPC system. The following formula is used for node theoretical peak performance:

Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores) x (CPU instruction per cycle) x (number of CPUs per node)

For cluster: CPUs based on Intel i7-2600 (3.40GHz 4-cores): 3.40 x 4 x 4 = 54.4 GFlops CPU speed in GHz: 3.40 No. of cores per CPU: 4

55 4.7 Benchmarking

No. of instructions per cycle: 4

Four PC Clusters Theoretical Peak Performance: 54.4 GFlops x 4 = 217.6 GFlops

4.7 Benchmarking

It is generally a good idea to verify that the newly built cluster actually can do work. This can be accomplished by running a few industry accepted benchmarks. The purpose of benchmarking is not to get the best results, but to get consistent repeatable accurate results that are also the best results.

4.8 HPL

HPL (High Performance Linpack) is a software package that solves a (random) dense linear system of equations in double precision (64 bits) arithmetic on distributed- memory computers. The performance measured using this program on several computers forms the basis for the Top 500 super computer list. Using ATLAS (Automatically Tuned Linear Algebra Software) for the BLAS library it gives 28.67 GFlops for 4 node cluster.

4.8.1 HPL Tuning

After having built the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl, one may want to modify the input data file HPL.dat. This file should reside in the same directory as the executable /root/hpl-2.0/bin/Linux PII CBLAS/xhpl. An example HPL.dat file is provided by default. This file contains information about the problem sizes, machine configuration, and algorithm features to be used by the executable. It is 31 lines long. All the selected parameters will be printed in the output generated by the executable. There so many ways to tackle tuning, for example:

1. Fixed Processor Grid, Fixed Block size and Varying Problem size N.

2. Fixed Processor Grid, Fixed Problem size and Varying Block size.

3. Fixed Problem size, Fixed Block size and Varying the Processor grid.

4. Fixed Problem size, Varying the Block size and Varying the Processor grid.

56 4.8 HPL

5. Fixed Block size, Varying the Problem size and Varying the Processor grid.

HPL.dat file for cluster

HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 41328 Ns 1 # of NBs 168 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 4 Ps 4 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>= 0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)

57 4.8 HPL

4.8.2 Run HPL on cluster

At this point all that remains is to add some software that can run on the cluster and there is nothing better than HPL or Linpack, which is widely used to measure cluster efficiency (the ratio between theoretical and actual performance). Do the following steps on all nodes: Copy Make.Linux PII CBLAS file from $(HOME)/hpl-2.0/setup/ to $(HOME)/hpl- 2.0/ Edit Make.Linux PII CBLAS file # ———————————————————————- # - HPL Directory Structure / HPL library —————————— # ———————————————————————- # TOPdir = $(HOME)/hpl-2.0 INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a # # ———————————————————————- # - Message Passing library (MPI) ————————————– # ———————————————————————- # MPinc tells the C compiler where to find the Message Passing library # header files, MPlib is defined to be the name of the library to be # used. The variable MPdir is only used for defining MPinc and MPlib. # #MPdir = /usr/lib64/mpich2 #MPinc = -I$(MPdir)/include #MPlib = $(MPdir)/lib/libmpich.a # # ———————————————————————- # - Linear Algebra library (BLAS or VSIPL) —————————– # ———————————————————————- # LAinc tells the C compiler where to find the Linear Algebra library # header files, LAlib is defined to be the name of the library to be # used. The variable LAdir is only used for defining LAinc and LAlib. #

58 4.8 HPL

LAdir = /usr/lib/atlas LAinc = LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a # ———————————————————————- # - Compilers / linkers - Optimization flags ————————— # ———————————————————————- # CC = /opt/mpich2-1.4.1p1/bin/mpicc CCNOOPT = $(HPL DEFS) CCFLAGS = $(HPL DEFS) -fomit-frame-pointer -O3 -funroll-loops # # On some platforms, it is necessary to use the Fortran linker to find # the Fortran internals used in the BLAS library. # LINKER = /opt/mpich2-1.4.1p1/bin/mpicc LINKFLAGS = $(CCFLAGS) # ARCHIVER = ar ARFLAGS = r RANLIB = echo # # ———————————————————————- After configuring the above file Make.Linux PII CBLAS run $(HOME)make arch=Linux PII CBLAS Now run Linpack (on a single node): $(HOME)cd bin/Linux PII CBLAS $mpiexec -n 4 ./xhpl Repeat steps 1- 5 on all nodes and the now Linpack can be run on all nodes like this (from directory $(HOME)/hpl-2.0/Linux PII CBLAS/ ) $mpiexec -n x ./xhpl where x is the number of cores in cluster.

4.8.3 HPL results

The first thing to note is that the HPL.dat file that is available post install is simply useless to extract any kind of meaningful performance numbers, so the

59 4.9 Run HPCC on cluster

file needs to be edited. The first test using the default configuration. Then tune the HPL.dat and test again. For a single PC the HPL gave performance of 11.25

Figure 4.8: Graph showing High Performance Linpack (HPL) Results

GFlops. The highest value which is given by the cluster of four machines is 28.67 GFlops. Which means there is an absolute performance gain in cluster over a single machine. It is interesting to note that the maximum performance (28.67 GFlops) was achieved for a problem size of 30000 and a block size of 168, although to be fair, the difference between a block size of 168 and 128 is small. Also interesting is how much the data varies for different problem size, the PCs in the cluster don’t have a separate network and thus performance is unlikely to ever be constant. The efficiency of the cluster is 13%, which is appalling, but given the various limitations in the system it’s perhaps not that surprising.

4.9 Run HPCC on cluster

The HPC Challenge Benchmark set of tests is primarily High Performance Lin- pack along with some additional bells-and-whistles tests. The nice thing is the experience in running HPL can be directly leveraged in running HPCC, and vice- versa. Instead of a binary named xhpl, with HPCC, a binary named hpcc is generated after compiling the HPC Challenge Benchmark. This binary runs the whole series of tests. First download hpcc-1.4.1.tar.gz and save into /root directory. Following

60 4.9 Run HPCC on cluster

is a set of commands to get going with HPCC: #cd /root #tar xzvf hpcc-1.4.1.tar.gz #cd hpcc-1.4.1 #cd hpl #cp setup/Make.Linux PII CBLAS #vi Make.Linux PII CBLAS Apply the same changes to this file as in section Compiling and Running HPL above except Topdir=../../.. Next, there is a need to build the HPC Challenge Benchmark, configure hpccinf.dat (which can be derived from the previous settings for HPL.dat), and then invoke the tool. After modifying Make.Linux PII CBLAS: #cd /root/hpcc-1.4.1 #make arch=Make.Linux PII CBLAS Copy hpccinf.txt found in the root hpcc-1.4.1 directory to hpccinf.txt. Make the following changes to lines 33-36 to control the problem sizes and blocking factors for PTRANS. Change lines 33-34 (number of PTRANS problems sizes and the sizes) to: 4 Number of additional problem sizes for PTRANS 1000 2500 5000 10000 values of N Change lines 35-36 (number of block sizes and the sizes) to: 2 Number of additional blocking sizes for PTRANS 64 128 values of NB Now run it: #cd /root/hpcc-1.4.1 #mpiexec np < numprocs > ./hpcc The results will be in hpccoutf.txt.

4.9.1 HPCC Results

Finally few HPCC benchmark runs are carried out. As with the Linpack bench- marks, the HPCC benchmark with ATLAS is also compiled. Generally speaking cluster continues to perform better than single PC but clearly some of the benchmarks are hardly affected at all. It is worth bearing in mind that this four node cluster does not have its own separate network switch and thus results will vary more than in a cluster with dedicated networking. Table 4.1 shows some import results from various tests of HPCC Benchmark suite.

61 4.9 Run HPCC on cluster

Test One Processor Cluster HPL Tflops 0.00072716 0.0283605 StarDGEMM Gflops 4.83506 4.77583 SingleDGEMM Gflops 4.79438 4.92708 PTRANS GBs 0.0425573 0.0409784 MPIRandomAccess LCG GUPs 0.00707042 0.00663434 MPIRandomAccess GUPs 0.00706074 0.00660636 StarRandomAccess LCG GUPs 0.176132 0.0170042 SingleRandomAccess LCG GUPs 0.171557 0.0344993 StarRandomAccess GUPs 0.24612 0.0183594 SingleRandomAccess GUPs 0.241174 0.0448233 StarSTREAM Copy 27.0668 2.92135 StarSTREAM Scale 25.3788 2.91262 StarSTREAM Add 27.1221 3.23188 StarSTREAM Triad 25.4848 3.40194 SingleSTREAM Copy 26.7578 10.9827 SingleSTREAM Scale 24.7451 10.9912 SingleSTREAM Add 26.3792 12.6537 SingleSTREAM Triad 24.7451 12.7064 StarFFT Gflops 2.14797 1.33174 SingleFFT Gflops 2.10237 1.85049 MPIFFT N 65536 134217728 MPIFFT Gflops 0.0587352 0.107084 MaxPingPongLatency usec 340.059 344.502 RandomlyOrderedRingLatency usec 167.139 154.586 MinPingPongBandwidth GBytes 0.0116524 0.0116511 NaturallyOrderedRingBandwidth GBytes 0.0104104 0.00243253 RandomlyOrderedRingBandwidth GBytes 0.00981377 0.00228357 MinPingPongLatency usec 322.998 0.203097 AvgPingPongLatency usec 334.724 267.98 MaxPingPongBandwidth GBytes 0.0116628 0.0116638 AvgPingPongBandwidth GBytes 0.0116578 0.0116587 NaturallyOrderedRingLatency usec 126.505 130.391

Table 4.1: HPCC Results on Single PC and Cluster

62 Chapter 5

Results and Applications

5.1 Discussion on Results

Clusters effectively reduce the overall computational time, demonstrating excellent performance improvement in terms of Flops.Finally, performance on clusters may be limited by interconnect speed. Finally, performance on clusters may be limited by interconnect speed. The choice of which interconnect to use depends more on whether inter-server communications will be a bottleneck in the mix of jobs to be run.

5.1.1 Observations about Small Tasks

1. Jobs with very small numbers are bound by communication time.

2. Since sequential runtime is so small, the time to send and receive from the head node makes the program take longer with more nodes, and makes adding processors slow down the programs runtime.

3. Parallel execution of such computations is impractical.

4. Speedup is observed by using a small cluster, but it doesnt scale well at all.

5. Its better off with one processor than even a remotely large cluster.

5.1.2 Observations about Larger Tasks

1. Jobs with larger numbers as input are bound by sequential computation time for a small number of processors, but eventually adding processors causes communication time to take over.

63 5.2 Factors affecting Cluster performance

2. Sequential runtime with large numbers is much larger, so it scales much better than with small numbers as input.

3. Inter-node communication has a much larger effect on runtime than intra- node communication.

4. With infinitely large numbers, communication times would be negligible.

5. Unlike with job requiring very little sequential computation and a lot of communication, this job achieved speedup with large numbers of processors.

Due to the various overheads discussed throughout certain part of a sequential algorithm cannot be parallelized we may not achieve an optimal parallelization. In such cases the performance gain is not there rather in some cases the performance is degraded due to communication and synchronization overhead.

5.2 Factors affecting Cluster performance

As per the result analysis of various tests and benchmarks here are few of the most important factors which affect the performance of the cluster. Metrics having significant affect on Linpack are:

1. Problem Size

2. Size of Blocks

3. Topology

Tightly Coupled MPI Applications

1. Very sensitive to network performance characteristics like internodal com- munications delay or OS Network Stack

2. Very sensitive to mismatched node performance like random OS activities can add msec delays to usec type communication line delays.

5.3 Benefits

1. Cost-effective: Built from relatively inexpensive commodity components that are widely available.

64 5.4 Challenges of parallel computing

2. Keeps pace with technologies: Use mass-market components. Easy to em- ploy the latest technologies to maintain the cluster.

3. Flexible configuration: Users can tailor a configuration that is feasible to them and allocate the budget wisely to meet the performance requirements of their applications.

4. Scalability: Can be easily scaled up by adding more compute nodes.

5. Usability: The system can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use.

6. Manageability: Group of systems can be managed as a single system or single database, without having to sign on to individual systems. Even a cluster administrative domain can be used to more easily manage resources that are shared within a cluster.

7. Reliability: The system, including all hardware, firmware, and software, will satisfactorily perform the task for which it was designed or intended, for a specified time and in a specified environment.

8. High availability: Each compute node is an individual machine. The failure of a compute node will not affect other nodes or the availability of the entire cluster.

9. Compatibility and Portability: A parallel application using MPI can be easily ported from expensive parallel computers to a Beowulf cluster.

5.4 Challenges of parallel computing

Parallel Programming is not constrained to just the problem of selection of whether to code using threads, message passing or some other tool. But in general, anybody working in the field of parallelization must consider the overall picture containing a plethora of issues like:

1. Understanding the hardware: An understanding of the parallel computer architecture is necessary for efficient mapping and distribution of computa- tional tasks. A simplified classification of parallel architectures is UMA/NUMA and distributed systems. A typical application may have to run on a com- bination of these architectures.

65 5.4 Challenges of parallel computing

2. Mapping and distribution on to the hardware: Mapping and distribution of both computational tasks on processors and of data onto memory elements must be considered. The whole application must be divided into compo- nents and subcomponents and then these components and subcomponents distributed on the hardware. The distribution may be static or dynamic.

3. Parallel Overhead: Parallel overhead refers to the amount of time required to coordinate parallel tasks as opposed to doing useful work. Typical par- allel overhead includes the time to start/terminate a task, the time to pass messages between tasks, synchronization time, and other extra computation time. When parallelizing a serial application, overhead is inevitable. De- velopers have to estimate the potential cost and try to avoid unnecessary overhead caused by inefficient design or operations.

4. Synchronization: Synchronization is necessary in multi-threading programs to prevent race conditions. Synchronization limits parallel efficiency even more than parallel overhead in that it serializes parts of the program. Im- proper synchronization methods may cause incorrect results from the pro- gram. Developers are responsible for pinpointing the shared resources that may cause race conditions in a multi-threaded program, and they are re- sponsible also for adopting proper synchronization structures and methods to make sure resources are accessed in the correct order without inflicting too much of a performance penalty.

5. Load Balance: Load balance is important in a threaded application because poor load balance causes under utilization of processors. After one task fin- ishes its job on a processor, the processor is idle until new tasks are assigned to it. In order to achieve the optimal performance result, developers need to find out where the imbalance of the work load lies between different threads running on the processors and fix this imbalance by spreading out the work more evenly for each thread.

6. Granularity: For a task that can be divided and performed concurrently by several subtasks, it is usually more efficient to introduce threads to perform some subtasks. However, there is always a tipping point where performance cannot be improved by dividing a task into smaller-sized tasks (or introduc- ing more threads). The reasons for this are 1) multi-threading causes extra overhead; 2) the degree of concurrency is limited by the number of proces- sors; and 3) for most of the time, one subtask’s execution is dependent on

66 5.5 Common applications of high-performance computing clusters

another’s completion. That is why developers have to decide to what extent they make their application parallel. The bottom line is that the amount of work per each independent task should be sufficient to leverage the threading cost.

5.5 Common applications of high-performance computing clusters

Almost everyone needs fast processing power. With the increasing availability of cheaper and faster computers, more people are interested in reaping the techno- logical benefits. There is no upper boundary to the needs of computer processing power; even with the rapid increase in power, the demand is considerably more than what’s available.

1. Scheduling: Manufacturing: Transportation (Dairy delivery to military de- ployment); University classes; Airline scheduling.

2. Network Simulations: Power Utilities, Telecommunications providers simu- lations.

3. Computational ElectroMagnetics: Antenna design; Stealth vehicles; Noise in high frequency circuits; Mobile phones.

Figure 5.1: Application Perspective of Grand Challenges

67 5.5 Common applications of high-performance computing clusters

4. Environmental Modelling-Earth/Ocean/Atmospheric Simulation: Weather forecasting, climate simulation, oil reservoir simulation, waste repository simulation

5. Simulation on Demand: Education, tourism, city planning, defense mission planning, generalized flight simulator.

6. Graphics Rendering: Hollywood movies, Virtual reality.

7. Complex Systems Modelling and Integration: Defense (SIMNET, Flight Simulators), Education (SIMCITY), Multimedia/VR in entertainment, Mul- tiuser virtual worlds, Chemical and Nuclear plant operation .

8. Financial and Economic Modelling: Real time optimisation, Mortgage backed securities, Option pricing.

9. Image Processing: Medical instruments, EOS Mission to Planet Earth, De- fense Surveillance, Computer Vision.

10. Healthcare and Insurance Fraud Detection: Inefficiency, Securities fraud, Credit card fraud.

11. Market Segmentation Analysis: Marketing and sales planning. Sort and classify records to determine customer preference by region (city and house).

68 Chapter 6

Conclusion and Future Work

6.1 Conclusion

The implemented HPCC system allows any research center to install and use a low- cost parallel programming environment, which may be administered in an easy- to-use basis even by staff unfamiliar with clusters. Such clusters allow evaluating the efficiency of any parallel code to solve the computational problems faced by the scientific community. This type of parallel programming environments are expected to be subject to a great development efforts within the coming years, since an increasing number of universities and research centers around the world include Beowulf clusters in their hardware. The main disadvantage with this type of environment could be the latency of the interconnections between the machines. This HPCC can be used for research on object-oriented parallel languages, recursive matrix algorithms, network protocol optimization, graphical rendering etc. Also it can be used to create college’s own cloud and deploy cloud applications on it, which can be accessed from anywhere outside world just with the help of web browser. Computer science and Information Technology students will receive extensive experience using such cluster, and t is expected that several students and faculty will use it for their project and research work.

6.2 Future Work

As computer networks become cheaper and faster, a new computing paradigm, called the Grid, has evolved. The Grid is a large system of computing resources that performs tasks and provides to users a single point of access, commonly based on the World Wide Web interface, to these distributed resources. Users can submit

69 6.2 Future Work thousands of jobs at a time without being concerned about where they run. The Grid may scale from single systems to supercomputer-class compute farms that utilise thousands of processors. By providing scalable, secure, high-performance mechanisms for discovering and negotiating access to remote resources, the Grid promises to make it possible for colleges and universities in collaboration to share resources on an unprece- dented scale, and for geographically distributed groups to work together in ways that were previously impossible. Additionally, the HPCC can be used to create cloud applications and give actual experience of this very booming technology to students. The advantages of cloud computing could work in the students advantage when it comes to getting hands-on experience in managing environments. Before virtualization, it would have been impossible for an individual student to practice managing their own multiple-server environment. Even just three servers would have cost thousands of dollars in years past. But now, with virtualization, it takes just a few minutes to spin up three new VMs. If a college were to leverage virtualization in its classroom, students could manage their own multi-server environment in the cloud with ease. The student could control everything from creation of the VMs to their retirement, giving them great experience in one of the hottest fields in IT.

70 Bibliography

[1] Christian Vecchiola, Suraj Pandey, and Rajkumar Buyya : High-Performance Cloud Computing: A View of Scientific Applications at Proceedings of the 10th International Symposium on Pervasive Systems, Algorithms and Networks (I- SPAN 2009, IEEE CS Press, USA), Kaohsiung, Taiwan, December 14-16, 2009

[2] Luiz Carlos Pinto, Luiz H. B. Tomazella, M. A. R. Dantas : An Experimental Study on How to Build Efficient Multi-Core Clusters for High Performance Computing at 2008 11th IEEE International Conference on Computational Science and Engineering.

[3] IkerCastaos, IzaskunGarrido, AitorGarrido, GorettiSevillano: Design and Im- plementation of an easy-to-use Automated System to build Beowulf Parallel- Computing Clusters at University of the Basque, IEEE International Confer- ence 2009

[4] Azzedine Boukerche Raed Al-Shaikh and Mirela Sechi Moretti Notare :To- wards Building a Highly-Available Cluster Based Model for High Performance Computing at Proceedings 20th IEEE International Parallel and Distributed Processing Symposium 2006

[5] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia. : Above the Clouds: A Berkeley View of Cloud computing. Technical Report No. UCB/EECS-2009- 28, University of California at Berkley, USA, Feb. 10, 2009.

[6] R. Buyya, C.S. Yeo, and S. Venugopal, Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities, Keynote Paper, in Proc. 10th IEEE International Conference on High Perfor- mance Computing and Communications (HPCC 2008), IEEE CS Press, Sept. 2527, 2008, Dalian, China.

71 BIBLIOGRAPHY

[7] Bonnie Holte Bennett, Emmett Davis, Timothy Kunau : Beowulf Parallel Processing for Dynamic Load-balancing, IEEE International Conference 1999

[8] Lustre File System : High-Performance Storage Architecture and Scalable Cluster File System White Paper December 2007

[9] Amina Saify, Garima Kochhar, Jenwei Hsieh, and Onur Celebioglu: Enhancing High-Performance Computing Clusters with Parallel File Systems, 2005

[10] Rajkumar Buyya, High Performance Cluster Computing: Architectures and Systems, Vol. 1 Ed. Prentice Hall PTR, Upper Saddle River, NJ, 1999

[11] Prabhu, C.S.R., Grid and Cluster Computing. Prentice Hall India 2009

[12] Judith Hurwitz, Robin Bloor, Marcia Kaufman, Fern Halper, Cloud Com- puting For Dummies, John Wiley and Sons, 2009

[13] Barrie Sosinsky, Cloud Computing Bible, Wiley India 2011

[14] Christopher Negus, Timothy Boronczyk, CentOS Bible, Wiley, 2009

[15] Vladimir Silva, Grid Computing for Developers, Dreamtech Press, 2006

[16] Peter Membrey, Tim Verhoeven, Ralph Angenendt , The Definitive Guide to CentOS, Apress, 2009

[17] Grid computing, http://www.ctwatch.org/quarterly/articles/2006/ 02/garuda-indias-national-grid-computing-initiative/1/index.html

[18] Torque resources, http://www.adaptivecomputing.com/products/ open-source/torque/

[19] Introduction to Torque, http://www.clusterresources.com/ torquedocs21/p.introduction.shtml

[20] High Performance Computing Training, https://computing.llnl.gov/ ?set=trainingandpage=index

[21] Applications of HPCC, http://www.new-npac.org/projects/cdroms/ cewes-1999-06-vol1/nhse/roadmap/applications/

[22] Beowulf Project Overview, http://www.beowulf.org/overview/index. html

[23] Beowulf clusters, http://www.lehigh.edu/computing/linux/beowulf/

72 BIBLIOGRAPHY

[24] Parallel Virtual Machine, http://www.csm.ornl.gov/pvm/

[25] Open MPI Project, http://www.open-mpi.org.

[26] Message Passing Interface, http://www.unix.mcs.anl.gov/mpi/

[27] Beowulf Overview, http://www.beowulf.org/overview/faq.html17

[28] High-performance Linux clustering, Part 1: Clustering fundamentals, http: //www.ibm.com/developerworks/linux/library/l-cluster1/, 2005

73 Appendix A

PuTTy

PuTTY is a free and open source terminal emulator application which can act as a client for the SSH, Telnet, rlogin, and raw TCP computing protocols and as a serial console client. The name ”PuTTY” has no definitive meaning, though ”tty” is the name for a terminal in the Unix tradition, usually held to be short for Teletype. PuTTY was originally written for Microsoft Windows, but it has been ported to various other operating systems. Official ports are available for some Unix- like platforms, with work-in-progress ports to Classic Mac OS and Mac OS X, and unofficial ports have been contributed to platforms such as Symbian and Windows Mobile.

A.1 How to use PuTTY to connect to a remote computer

1. First download and install PuTTy. Open PuTTy By Double Clicking The PuTTy Icon.

2. In the host name box, enter the server name which account is being hosted on ( For Example: 115.119.224.72 ). under protocol choose SSH and then press open.

3. It will then give a dialogue box like this, don’t be alarmed, simply press yes when prompted.

4. It will prompt to enter login name (username) and then password, simply enter username or login name, hit enter and then type password (password

74 A.2 PSCP

Figure A.1: Putty GUI

Figure A.2: Putty Security Alert

won’t be visible. This is how linux and unix server work). Then hit enter. Also, please remember, passwords are case sensitive.

A.2 PSCP

PSCP, the PuTTY Secure Copy client, is a tool for transferring files securely between computers using an SSH connection. If SSH 2 server is there, prefer PSFTP for interactive use. PSFTP does not in general work with SSH 1 servers, however.

75 A.2 PSCP

Figure A.3: Putty Remote Login Screen

A.2.1 Starting PSCP

PSCP is a command line application. This means that just double-click on its icon to run it won’t work and instead bring up a console window. With Windows 95, 98, and ME, this is called an MS-DOS Prompt and with Windows XP, Vista and Windows 7 it is called a Command Prompt. It should be available from the Programs section of Start Menu. To start PSCP it will need either to be on PATH or in current directory. To add the directory containing PSCP to PATH environment variable, type into the console window: set PATH=C:\Program Files < x86 >\PuTTy This will only work for the lifetime of that particular console window. To set PATH more permanently on Windows NT, use the Environment tab of the System Control Panel. On Windows XP, Vista,7 edit AUTOEXEC.BAT to include a set command like the one above.

A.2.2 PSCP Usage

To copy the local file c:\documents\foo.txt form windows to the linux server ex- ample.com as user beowulf to the folder /tmp type: C:\Users\FOSS>pscp c:\documents\foo.txt [email protected]:/tmp To copy the local file /root/hosts from linux machine to the file e:\tmp on windows type: C:\Users\FOSS>pscp [email protected]:/root/hosts e:\tmp

76