Scyld ClusterWare HPC

User’s Guide Scyld ClusterWare HPC: User’s Guide Revised Edition Published April 4, 2010 Copyright © 1999 - 2010 Penguin Computing, Inc.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording or otherwise) without the prior written permission of the publisher. Scyld ClusterWare, the Highly Scyld logo, and the Penguin Computing logo are trademarks of Penguin Computing, Inc. All other trademarks and copyrights referred to are the property of their respective owners. Table of Contents Preface ...... v Feedback...... v 1. Scyld ClusterWare Overview ...... 1 What Is a ?...... 1 A Brief History of the Beowulf...... 1 First-Generation Beowulf Clusters ...... 2 Scyld ClusterWare: A New Generation of Beowulf...... 3 Scyld ClusterWare Technical Summary ...... 3 Top-Level Features of Scyld ClusterWare ...... 3 Process Space Migration Technology ...... 5 Compute Node Provisioning...... 5 Compute Node Categories ...... 5 Compute Node States...... 5 Major Software Components ...... 6 Typical Applications of Scyld ClusterWare...... 7 2. Interacting With the System...... 9 Verifying the Availability of Nodes ...... 9 Monitoring Node Status...... 9 The BeoStatus GUI Tool...... 9 BeoStatus Node Information ...... 10 BeoStatus Update Intervals ...... 10 BeoStatus in Text Mode ...... 11 The bpstat Command Line Tool...... 11 The beostat Command Line Tool...... 12 Issuing Commands...... 14 Commands on the Master Node...... 14 Commands on the Compute Node ...... 14 Examples for Using bpsh...... 14 Formatting bpsh Output...... 15 bpsh and Shell Interaction ...... 16 Copying Data to the Compute Nodes ...... 17 Sharing Data via NFS ...... 17 Copying Data via bpcp...... 17 Programmatic Data Transfer ...... 18 Data Transfer by Migration...... 18 Monitoring and Controlling Processes ...... 18 3. Running Programs ...... 21 Program Execution Concepts ...... 21 Stand-Alone Computer vs. Scyld Cluster ...... 21 Traditional Beowulf Cluster vs. Scyld Cluster...... 21 Program Execution Examples...... 22 Environment Modules...... 24 Running Programs That Are Not Parallelized ...... 24 Starting and Migrating Programs to Compute Nodes (bpsh)...... 25 Copying Information to Compute Nodes (bpcp) ...... 25 Running Parallel Programs ...... 26 An Introduction to Parallel Programming APIs...... 26

iii MPI...... 27 PVM ...... 28 Custom APIs...... 28 Mapping Jobs to Compute Nodes ...... 28 Running MPICH Programs...... 29 mpirun...... 29 Setting Mapping Parameters from Within a Program ...... 29 Examples ...... 30 Running OpenMPI Programs...... 30 Pre-Requisites to Running OpenMPI ...... 31 Using OpenMPI...... 31 Running PVM-Aware Programs ...... 31 Porting Other Parallelized Programs...... 32 Running Serial Programs in Parallel...... 32 mpprun ...... 32 Options ...... 32 Examples ...... 33 beorun...... 33 Options ...... 33 Examples ...... 34 Job Batching ...... 34 Job Batching Options for ClusterWare ...... 34 Job Batching with TORQUE ...... 34 Running a Job...... 34 Checking Job Status ...... 36 Finding Out Which Nodes Are Running a Job...... 36 Finding Job Output...... 36 File Systems...... 37 Sample Programs Included with Scyld ClusterWare...... 37 linpack...... 37 mpi-mandel ...... 38 A. Glossary of Parallel Computing Terms ...... 41

iv Preface

Welcome to the Scyld ClusterWare HPC User’s Guide. This manual is for those who will use ClusterWare to run appli- cations, so it presents the basics of ClusterWare parallel computing — what ClusterWare is, what you can do with it, and how you can use it. The manual covers the ClusterWare architecture and discusses the unique features of Scyld Cluster- Ware HPC. It will show you how to navigate the ClusterWare environment, how to run programs, and how to monitor their performance. Because this manual is for the user accessing a ClusterWare system that has already been configured, it does not cover how to install, configure, or administer your Scyld cluster. You should refer to other parts of the Scyld documentation set for additional information, specifically:

• If you have not yet built your cluster or installed Scyld ClusterWare HPC, refer to the Installation Guide.

• If you are looking for information on how to administer your cluster, refer to the Administrator’s Guide.

• If you plan to write programs to use on your Scyld cluster, refer to the Programmer’s Guide.

Also not covered is use of the operating system, on which Scyld ClusterWare is based. Some of the basics are presented here, but if you have not used Linux or Unix before, a book or online resource will be helpful. Books by O’Reilly and Associates1 are good sources of information. This manual will provide you with information about the basic functionality of the utilities needed to start being productive with Scyld ClusterWare.

Feedback We welcome any reports on errors or difficulties that you may find. We also would like your suggestions on improving this document. Please direct all comments and problems to [email protected]. When writing your email, please be as specific as possible, especially with errors in the text. Please include the chapter and section information. Also, please mention in which version of the manual you found the error. This version is Scyld ClusterWare HPC, Revised Edition, published April 4, 2010.

Notes 1. http://www.oreilly.com

v Preface

vi Chapter 1. Scyld ClusterWare Overview

Scyld ClusterWare is a Linux-based high-performance computing system. It solves many of the problems long associated with Linux Beowulf-class cluster computing, while simultaneously reducing the costs of system installation, administration, and maintenance. With Scyld ClusterWare, the cluster is presented to the user as a single, large-scale parallel computer. This chapter presents a high-level overview of Scyld ClusterWare. It begins with a brief history of Beowulf clusters, and discusses the differences between the first-generation Beowulf clusters and a Scyld cluster. A high-level technical summary of Scyld ClusterWare is then presented, covering the top-level features and major software components of Scyld. Finally, typical applications of Scyld ClusterWare are discussed. Additional details are provided throughout the Scyld ClusterWare HPC documentation set.

What Is a Beowulf Cluster? The term "Beowulf" refers to a multi-computer architecture designed for executing parallel computations. A "Beowulf cluster" is a parallel computer system conforming to the Beowulf architecture, which consists of a collection of commodity off-the-shelf computers (COTS) (referred to as "nodes"), connected via a private network running an open-source operating system. Each node, typically running Linux, has its own processor(s), memory storage, and I/O interfaces. The nodes communicate with each other through a private network, such as Ethernet or Infiniband, using standard network adapters. The nodes usually do not contain any custom hardware components, and are trivially reproducible. One of these nodes, designated as the "master node", is usually attached to both the private and public networks, and is the cluster’s administration console. The remaining nodes are commonly referred to as "compute nodes". The master node is responsible for controlling the entire cluster and for serving parallel jobs and their required files to the compute nodes. In most cases, the compute nodes are configured and controlled by the master node. Typically, the compute nodes require neither keyboards nor monitors; they are accessed solely through the master node. From the viewpoint of the master node, the compute nodes are simply additional processor and memory resources. In conclusion, Beowulf is a technology of networking Linux computers together to create a parallel, virtual . The collection as a whole is known as a "Beowulf cluster". While early Linux-based Beowulf clusters provided a cost- effective hardware alternative to the of the day, allowing users to execute high-performance computing applications, the original software implementations were not without their problems. Scyld ClusterWare addresses — and solves — many of these problems.

A Brief History of the Beowulf Cluster computer architectures have a long history. The early network-of-workstations (NOW) architecture used a group of standalone processors connected through a typical office network, their idle cycles harnessed by a small piece of special software, as shown below.

1 Chapter 1. Scyld ClusterWare Overview

Figure 1-1. Network-of-Workstations Architecture

The NOW concept evolved to the Pile-of-PCs architecture, with one master PC connected to the public network and the remaining PCs in the cluster connected to each other, and the master, through a private network as shown in the following figure. Over time, this concept solidified into the Beowulf architecture.

Figure 1-2. A Basic Beowulf Cluster

For a cluster to be properly termed a "Beowulf", it must adhere to the "Beowulf philosophy", which requires:

• Scalable performance

• The use of commodity off-the-shelf (COTS) hardware

• The use of an open-source operating system, typically Linux Use of commodity hardware allows Beowulf clusters to take advantage of the economies of scale in the larger computing markets. In this way, Beowulf clusters can always take advantage of the fastest processors developed for high-end work- stations, the fastest networks developed for backbone network providers, and so on. The progress of Beowulf clustering technology is not governed by any one company’s development decisions, resources, or schedule.

2 Chapter 1. Scyld ClusterWare Overview

First-Generation Beowulf Clusters The original Beowulf software environments were implemented as downloadable add-ons to commercially-available Linux distributions. These distributions included all of the software needed for a networked workstation: the kernel, various util- ities, and many add-on packages. The downloadable Beowulf add-ons included several programming environments and development libraries as individually-installable packages. With this first-generation Beowulf scheme, every node in the cluster required a full Linux installation and was responsible for running its own copy of the kernel. This requirement created many administrative headaches for the maintainers of Beowulf-class clusters. For this reason, early Beowulf systems tended to be deployed by the software application developers themselves (and required detailed knowledge to install and use). Scyld ClusterWare reduces and/or eliminates these and other problems associated with the original Beowulf-class clusters.

Scyld ClusterWare: A New Generation of Beowulf Scyld ClusterWare streamlines the process of configuring, administering, running, and maintaining a Beowulf-class clus- ter computer. It was developed with the goal of providing the software infrastructure for commercial production cluster solutions. Scyld ClusterWare was designed with the differences between master and compute nodes in mind; it runs only the ap- propriate software components on each compute node. Instead of having a collection of computers each running its own fully-installed operating system, Scyld creates one large distributed computer. The user of a Scyld cluster will never log into one of the compute nodes nor worry about which compute node is which. To the user, the master node is the computer, and the compute nodes appear merely as attached processors capable of providing computing resources. With Scyld ClusterWare, the cluster appears to the user as a single computer. Specifically,

• The compute nodes appear as attached processor and memory resources

• All jobs start on the master node, and are migrated to the compute nodes at runtime

• All compute nodes are managed and administered collectively via the master node The Scyld ClusterWare architecture simplifies cluster setup and node integration, requires minimal system administration, provides tools for easy administration where necessary, and increases cluster reliability through seamless scalability. In addition to its technical advances, Scyld ClusterWare provides a standard, stable, commercially-supported platform for deploying advanced clustering systems. See the next section for a technical summary of Scyld ClusterWare.

Scyld ClusterWare Technical Summary Scyld ClusterWare presents a more uniform system view of the entire cluster to both users and applications through exten- sions to the kernel. A guiding principle of these extensions is to have little increase in both kernel size and complexity and, more importantly, negligible impact on individual processor performance. In addition to its enhanced Linux kernel, Scyld ClusterWare includes libraries and utilities specifically improved for high- performance computing applications. For information on the Scyld libraries, see the Reference Guide. Information on using the Scyld utilities to run and monitor jobs is provided in Chapter 2 and Chapter 3. If you need to use the Scyld utilities to configure and administer your cluster, see the Administrator’s Guide.

Top-Level Features of Scyld ClusterWare The following list summarizes the top-level features of Scyld ClusterWare.

3 Chapter 1. Scyld ClusterWare Overview

Security and Authentication With Scyld ClusterWare, the master node is a single point of security administration and authentication. The authentication envelope is drawn around the entire cluster and its private network. This obviates the need to manage copies or caches of credentials on compute nodes or to add the overhead of networked authentication. Scyld ClusterWare provides simple permissions on compute nodes, similar to Unix file permissions, allowing their use to be administered without additional overhead.

Easy Installation Scyld ClusterWare is designed to augment a full Linux distribution, such as Red Hat Enterprise Linux (RHEL) or CentOS. The installer used to initiate the installation on the master node is provided on an auto-run CD-ROM. You can install from scratch and have a running Linux HPC cluster in less than an hour. See the Installation Guide for full details.

Install Once, Execute Everywhere A full installation of Scyld ClusterWare is required only on the master node. Compute nodes are provisioned from the master node during their boot process, and they dynamically cache any additional parts of the system during process migration or at first reference.

Single System Image Scyld ClusterWare makes a cluster appear as a multi-processor parallel computer. The master node maintains (and presents to the user) a single process space for the entire cluster, known as the BProc Distributed Process Space. BProc is described briefly later in this chapter, and more details are provided in the Administrator’s Guide.

Execution Time Process Migration Scyld ClusterWare stores applications on the master node. At execution time, BProc migrates processes from the master to the compute nodes. This approach virtually eliminates both the risk of version skew and the need for hard disks on the compute nodes. More information is provided in the section on process space migration later in this chapter. Also refer to the BProc discussion in the Administrator’s Guide.

Seamless Cluster Scalability Scyld ClusterWare seamlessly supports the dynamic addition and deletion of compute nodes without modification to existing source code or configuration files. See the chapter on the BeoSetup utility in the Administrator’s Guide.

Administration Tools Scyld ClusterWare includes simplified tools for performing cluster administration and maintenance. Both graphical user interface (GUI) and command line interface (CLI) tools are supplied. See the Administrator’s Guide for more information.

Web-Based Administration Tools Scyld ClusterWare includes web-based tools for remote administration, job execution, and monitoring of the cluster. See the Administrator’s Guide for more information.

Additional Features Additional features of Scyld ClusterWare include support for cluster power management (IPMI and Wake-on-LAN, easily extensible to other out-of-band management protocols); runtime and development support for MPI and PVM; and support for the LFS and NFS3 file systems.

4 Chapter 1. Scyld ClusterWare Overview

Fully-Supported Scyld ClusterWare is fully-supported by Penguin Computing, Inc.

Process Space Migration Technology Scyld ClusterWare is able to provide a single system image through its use of the BProc Distributed Process Space, the Be- owulf process space management kernel enhancement. BProc enables the processes running on compute nodes to be visible and managed on the master node. All processes appear in the master node’s process table, from which they are migrated to the appropriate compute node by BProc. Both process parent-child relationships and Unix job-control information are maintained with the migrated jobs. The stdout and stderr streams are redirected to the user’s ssh or terminal session on the master node across the network. The BProc mechanism is one of the primary features that makes Scyld ClusterWare different from traditional Beowulf clusters. For more information, see the system design description in the Administrator’s Guide.

Compute Node Provisioning Scyld ClusterWare utilizes light-weight provisioning of compute nodes from the master node’s kernel and Linux distribution. For Scyld Series 30 and Scyld ClusterWare HPC, PXE is the supported method for booting nodes into the cluster; the 2-phase boot sequence of earlier Scyld distributions is no longer used. The master node is the DHCP server serving the cluster private network. PXE booting across the private network ensures that the compute node boot package is version-synchronized for all nodes within the cluster. This boot package consists of the kernel, initrd, and rootfs. If desired, the boot package can be customized per node in the Beowulf configuration file /etc/beowulf/config, which also includes the kernel command line parameters for the boot package. For a detailed description of the compute node boot procedure, see the system design description in the Administrator’s Guide. Also refer to the chapter on compute node boot options in that document.

Compute Node Categories Compute nodes seen by the master over the private network are classified into one of three categories by the master node, as follows:

• Unknown — A node not formally recognized by the cluster as being either a "configured" or "ignored" node. When bringing a new compute node online, or after replacing an existing node’s network interface card, the node will be classified as "unknown".

• Ignored — Nodes which, for one reason or another, you’d like the master node to ignore. These are not considered part of the cluster, nor will they receive a response from the master node during their boot process.

• Configured — Those nodes listed in the cluster configuration file using the "node" tag. These are formally part of the cluster, recognized as such by the master node, and used as computational resources by the cluster.

For more information on compute node categories, see the system design description in the Administrator’s Guide.

5 Chapter 1. Scyld ClusterWare Overview

Compute Node States BProc maintains the current condition or "node state" of each configured compute node in the cluster. The compute node states are defined as follows:

• down — Node is not communicating with the master and its previous state was either "down", "up", "error", "unavailable", or "boot".

• unavailable — Node has been marked "unavailable" or "off-line" by the cluster administrator; typically used when per- forming maintenance activities.

• error — Node encountered an error during its initialization; this state may also be set manually by the cluster administra- tor.

• up — Node completed its initialization without error; node is online and operating normally. This is the only state in which end users may use the node.

• reboot — Node has been commanded to reboot itself; node will remain in this state until it reaches the "boot" state, as described below.

• halt — Node has been commanded to halt itself; node will remain in this state until it is reset (or powered back on) and reaches the "boot" state, as described below.

• pwroff — Node has been commanded to power itself off; node will remain in this state until it is powered back on and reaches the "boot" state, as described below.

• boot — Node has completed its stage 2 boot but is still initializing; after the node finishes booting, its next state will be either "up" or "error". For more information on compute node states, see the system design description in the Administrator’s Guide.

Major Software Components The following is a list of the major software components included with Scyld ClusterWare HPC. For more information, see the relevant sections of the Scyld ClusterWare HPC documentation set, including the Installation Guide, Administrator’s Guide, User’s Guide, Reference Guide, and Programmer’s Guide.

• BProc — The process migration technology; an integral part of Scyld ClusterWare.

• BeoSetup — A GUI for configuring the cluster.

• BeoStatus — A GUI for monitoring cluster status.

• beostat — A text-based tool for monitoring cluster status.

• beoboot — A set of utilities for booting the compute nodes.

• beofdisk — A utility for remote partitioning of hard disks on the compute nodes.

• beoserv — The cluster’s DHCP, PXE and dynamic provisioning server; it responds to compute nodes and serves the boot image.

• BPmaster — The BProc master daemon; it runs on the master node.

• BPslave — The BProc compute daemon; it runs on each of the compute nodes.

• bpstat —A BProc utility that reports status information for all nodes in the cluster.

• bpctl —A BProc command line interface for controlling the nodes.

• bpsh —A BProc utility intended as a replacement for rsh (remote shell).

6 Chapter 1. Scyld ClusterWare Overview

• bpcp —A BProc utility for copying files between nodes, similar to rcp (remote copy).

• MPI — The Message Passing Interface, optimized for use with Scyld ClusterWare.

• PVM — The Parallel Virtual Machine, optimized for use with Scyld ClusterWare.

• mpprun — A parallel job-creation package for Scyld ClusterWare.

Typical Applications of Scyld ClusterWare Scyld clustering provides a facile solution for anyone executing jobs that involve either a large number of computations or large amounts of data (or both). It is ideal for both large, monolithic, parallel jobs and for many normal-sized jobs run many times (such as Monte Carlo type analysis). The increased computational resource needs of modern applications are frequently being met by Scyld clusters in a number of domains, including:

• Computationally-Intensive Activities — Optimization problems, stock trend analysis, financial analysis, complex pattern matching, medical research, genetics research, image rendering

• Scientific Computing / Research — Engineering simulations, 3D-modeling, finite element analysis, computational fluid dynamics, computational drug development, seismic data analysis, PCB / ASIC routing

• Large-Scale Data Processing — Data mining, complex data searches and results generation, manipulating large amounts of data, data archival and sorting

• Web / Internet Uses — Web farms, application serving, transaction serving, data serving These types of jobs can be performed many times faster on a Scyld cluster than on a single computer. Increased speed depends on the application code, the number of nodes in the cluster, and the type of equipment used in the cluster. All of these can be easily tailored and optimized to suit the needs of your applications.

7 Chapter 1. Scyld ClusterWare Overview

8 Chapter 2. Interacting With the System

This chapter discusses how to verify the availability of the nodes in your cluster, how to monitor node status, how to issue commands and copy data to the compute nodes, and how to monitor and control processes. For information on running programs across the cluster, see Chapter 3.

Verifying the Availability of Nodes In order to use a Scyld cluster for computation, at least one node must be available or "up". Thus, the first priority when interacting with a cluster is ascertaining the availability of nodes. Unlike traditional Beowulf clusters, Scyld ClusterWare provides rich reporting about the availability of the nodes. You can use either the BeoStatus GUI tool or the bpstat command to determine the availability of nodes in your cluster. These tools, which can also be used to monitor node status, are described in the next section. If fewer nodes are "up" than you think should be, or some nodes report an error, check with your Cluster Administrator.

Monitoring Node Status You can monitor the status of nodes in your cluster with the BeoStatus GUI tool or with either of two command line tools, bpstat and beostat. These tools are described in the sections that follow. Also see the Reference Guide for information on the various options and flags supported for these tools.

The BeoStatus GUI Tool The BeoStatus graphical user interface (GUI) tool is the best way to check the status of the cluster, including which nodes are available or "up". There are two ways to open the BeoStatus GUI as a Gnome X window, as follows. Click the BeoStatus icon in the tool tray or in the applications pulldown.

Alternatively, type the command beostatus in a terminal window on the master node; you do not need to be a privileged user to use this command. The default BeoStatus GUI mode is a tabular format known as the "Classic" display (shown in the following figure). You can select different display options from the Mode menu.

9 Chapter 2. Interacting With the System

Figure 2-1. BeoStatus in the "Classic" Display Mode

BeoStatus Node Information Each row in the BeoStatus display reports information for a single node, including the following:

• Node — The node’s assigned node number, starting at zero. Node -1, if shown, is the master node. The total number of node entries shown is set by the "iprange" or "nodes" keywords in the file /etc/beowulf/config, rather than the number of detected nodes. The entry for an inactive node displays the last reported data in a grayed-out row.

• Up — A graphical representation of the node’s status. A green checkmark is shown if the node is up and available. Otherwise, a red "X" is shown.

• State — The node’s last known state. This should agree with the state reported by both the bpstat command and in the BeoSetup window.

• CPU "X" — The CPU loads for the node’s processors; at minimum, this indicates the CPU load for the first processor in each node. Since it is possible to mix uni-processor and multi-processor machines in a Scyld cluster, the number of CPU load columns is equal to the maximum number of processors for any node in your cluster. The label "N/A" will be shown for nodes with less than the maximum number of processors.

• Memory — The node’s current memory usage.

• Swap — The node’s current swap space (virtual memory) usage.

• Disk — The node’s hard disk usage. If a RAM disk is used, the maximum value shown is one-half the amount of physical memory. As the RAM disk competes with the kernel and application processes for memory, not all the RAM may be available.

• Network — The node’s network bandwidth usage. The total amount of bandwidth available is the sum of all network interfaces for that node.

BeoStatus Update Intervals Once running, BeoStatus is non-interactive; the user simply monitors the reported information. The display is updated at 4- second intervals by default. You can modify this default using the command beostatus -u secs (where secs is the number of seconds) in a terminal window or an ssh session to the master node with X-forwarding enabled.

10 Chapter 2. Interacting With the System

Tip: Each update places load on the master and compute nodes, as well as the interconnection network. Too-frequent updates can degrade the overall system performance.

BeoStatus in Text Mode In environments where use of the Gnome X window system is undesirable or impractical, such as when accessing the master node through a slow remote network connection, you can view the status of the cluster as curses text output (shown in the following figure). Do do this, enter the command beostatus -c in a terminal window on the master node or an ssh session to the master node. BeoStatus in text mode reports the same node information as reported by the "Classic" display, except for the graphical indicator of node "up" (green checkmark) or node "down" (red X). The data in the text display is updated at 4-second intervals by default.

Figure 2-2. BeoStatus in Text Mode

The bpstat Command Line Tool You can also check node status with the bpstat command. When run at a shell prompt on the master node without options, bpstat prints out a listing of all nodes in the cluster and their current status. You do not need to be a privileged user to use this command. Following is an example of the outputs from bpstat for a cluster with 10 compute nodes.

[user@cluster user] $ bpstat Node(s) Status Mode User Group 5-9 down ------root root 4 up ---x--x--x any any 0-3 up ---x--x--x root root

bpstat will show one of the following indicators in the "Status" column:

• A node marked "up" is available to run jobs. This status is the equivalent of the green checkmark in the BeoStatus GUI.

11 Chapter 2. Interacting With the System

• Nodes that have not yet been configured are marked as "down". This status is the equivalent of the red X in the BeoStatus GUI.

• Nodes currently booting are temporarily shown with a status of "boot". Wait 10-15 seconds and try again.

• The "error" status indicates a node initialization problem. Check with your Cluster Administrator. For additional information on bpstat, see the section on monitoring and controlling processes later in this chapter. Also see the Reference Guide for details on using bpstat and its command line options.

The beostat Command Line Tool You can use the beostat command to display raw status data for cluster nodes. When run at a shell prompt on the master node without options, beostat prints out a listing of stats for all nodes in the cluster, including the master node. You do not need to be a privileged user to use this command. The following example shows the beostat output for the master node and one compute node:

[user@cluster user] $ beostat model : 5 model name : AMD Opteron(tm) Processor 248 stepping : 10 cpu MHz : 2211.352 cache size : 1024 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes bogomips : 4422.05

*** /proc/meminfo *** Sun Sep 17 10:46:33 2006 total: used: free: shared: buffers: cached: Mem: 4217454592 318734336 3898720256 0 60628992 0 Swap: 2089209856 0 2089209856 MemTotal: 4118608 kB MemFree: 3807344 kB MemShared: 0 kB Buffers: 59208 kB Cached: 0 kB SwapTotal: 2040244 kB SwapFree: 2040244 kB

*** /proc/loadavg *** Sun Sep 17 10:46:33 2006 3.00 2.28 1.09 178/178 0

*** /proc/net/dev *** Sun Sep 17 10:46:33 2006 Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0:85209660 615362 0 0 0 0 0 0 703311290 559376 0 0 0 0 0 0 eth1:4576500575 13507271 0 0 0 0 0 0 9430333982 13220730 0 0 0 0 0 0

12 Chapter 2. Interacting With the System

sit0: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

*** /proc/stat *** cpu0 15040 0 466102 25629625 Sun Sep 17 10:46:33 2006 cpu1 17404 0 1328475 24751544 Sun Sep 17 10:46:33 2006

*** statfs ("/") *** Sun Sep 17 10:46:33 2006 path: / f_type: 0xef53 f_bsize: 4096 f_blocks: 48500104 f_bfree: 41439879 f_bavail: 38976212 f_files: 24641536 f_ffree: 24191647 f_fsid: 000000 000000 f_namelen: 255

======Node: .0 (index 0) ======

*** /proc/cpuinfo *** Sun Sep 17 10:46:34 2006 num processors : 2 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 248 stepping : 10 cpu MHz : 2211.386 cache size : 1024 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes bogomips : 4422.04

*** /proc/meminfo *** Sun Sep 17 10:46:34 2006 total: used: free: shared: buffers: cached: Mem: 4216762368 99139584 4117622784 0 0 0 Swap: 0 0 0 MemTotal: 4117932 kB MemFree: 4021116 kB MemShared: 0 kB Buffers: 0 kB Cached: 0 kB SwapTotal: 0 kB SwapFree: 0 kB

*** /proc/loadavg *** Sun Sep 17 10:46:34 2006 0.99 0.75 0.54 36/36 0

13 Chapter 2. Interacting With the System

*** /proc/net/dev *** Sun Sep 17 10:46:34 2006 Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed eth0:312353878 430256 0 0 0 0 0 0 246128779 541105 0 0 0 0 0 0 eth1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

*** /proc/stat *** cpu0 29984 0 1629 15340009 Sun Sep 17 10:46:34 2006 cpu1 189495 0 11131 15170565 Sun Sep 17 10:46:34 2006

*** statfs ("/") *** Sun Sep 17 10:46:34 2006 path: / f_type: 0x1021994 f_bsize: 4096 f_blocks: 514741 f_bfree: 492803 f_bavail: 492803 f_files: 514741 f_ffree: 514588 f_fsid: 000000 000000 f_namelen: 255

The Reference Guide provides details for using beostat and its command line options.

Issuing Commands

Commands on the Master Node When you log into the cluster, you are actually logging into the master node, and the commands you enter on the command line will execute on the master node. The only exception is when you use special commands for interacting with the compute nodes, as described in the next section.

Commands on the Compute Node Scyld ClusterWare provides the bpsh command for running jobs on the compute nodes. bpsh is a replacement for the traditional Unix utility rsh, used to run a job on a remote computer. Like rsh, the bpsh arguments are the node on which to run the command and the command. bpsh allows you to run a command on more than one node without having to type the command once for each node, but it doesn’t provide an interactive shell on the remote node like rsh does. bpsh is primarily intended for running utilities and maintenance tasks on a single node or a range of nodes, rather than for running parallel programs. For information on running parallel programs with Scyld ClusterWare, see Chapter 3. bpsh provides a convenient yet powerful interface for manipulating all (or a subset of) the cluster’s nodes simultaneously. bpsh provides you the flexibility to access a compute node individually, but removes the requirement to access each node individually when a collective operation is desired. A number of examples and options are discussed in the sections that follow. For a complete reference to all the options available for bpsh, see the Reference Guide.

14 Chapter 2. Interacting With the System

Examples for Using bpsh

Example 2-1. Checking for a File

You can use bpsh to check for specific files on a compute node. For example, to check for a file named output in the /tmp directory of node 3, you would run the following command on the master node: [user@cluster user] $ bpsh 3 ls /tmp/output The command output would appear on the master node terminal where you issued the command.

Example 2-2. Running a Command on a Range of Nodes

You can run the same command on a range of nodes using bpsh. For example, to check for a file named output in the /tmp directory of nodes 3 through 5, you would run the following command on the master node: [user@cluster user] $ bpsh 3,4,5 ls /tmp/output

Example 2-3. Running a Command on All Available Nodes

Use the -a flag to indicate to bpsh that you wish to run a command on all available nodes. For example, to check for a file named output in the /tmp directory of all nodes currently active in your cluster, you would run the following command on the master node: [user@cluster user] $ bpsh -a ls /tmp/output Note that when using the -a flag, the results are sorted by the response speed of the compute nodes, and are returned without node identifiers. Because this command will produce output for every currently active node, the output may be hard to read if you have a large cluster. For example, if you ran the above command on a 64-node cluster in which half of the nodes have the file being requested, the results returned would be 32 lines of /tmp/output and another 32 lines of ls: /tmp/output: no such file or directory. Without node identifiers, it is impossible to ascertain the existence of the target file on a particular node. See the next section for bpsh options that enable you to format the results for easier reading.

Formatting bpsh Output The bpsh command has a number of options for formatting its output to make it more useful for the user, including the following:

• The -L option makes bpsh wait for a full line from a compute node before it prints out the line. Without this option, the output from your command could include half a line from node 0 with a line from node 1 tacked onto the end, then followed by the rest of the line from node 0.

• The -p option prefixes each line of output with the node number of the compute node that produced it. This option causes the functionality for -L to be used, even if not explicitly specified.

15 Chapter 2. Interacting With the System

• The -s option forces the output of each compute node to be printed in sorted numerical order, rather than by the response speed of the compute nodes. With this option, all the output for node 0 will appear before any of the output for node 1. To add a divider between the output from each node, use the -d option.

• Using -d generates a divider between the output from each node. This option causes the functionality for -s to be used, even if not explicitly specified.

For example, if you run the command bpsh -a -d -p ls /tmp/output on an 8-node cluster, the output would make it clear which nodes do and do not have the file output in the /tmp directory, for example:

0 ------/tmp/output 1 ------1: ls: /tmp/output: No such file or directory 2 ------2: ls: /tmp/output: No such file or directory 3 ------3: /tmp/output 4 ------4: /tmp/output 5 ------5: /tmp/output 6 ------6: ls: /tmp/output: No such file or directory 7 ------7: ls: /tmp/output: No such file or directory

bpsh and Shell Interaction Special shell features, such as piping and input/output redirection, are available to advanced users. This section provides several examples of shell interaction, using the following conventions:

• The command running will be cmda.

• If it is piped to anything, it will be piped to cmdb.

• If an input file is used, it will be /tmp/input.

• If an output file is used, it will be /tmp/output.

• The node used will always be node 0.

Example 2-4. Command on Compute Node, Output on Master Node

The easiest case is running a command on a compute node and doing something with its output on the master node, or giving it input from the master. Following are a few examples: [user@cluster user] $ bpsh 0 cmda | cmdb [user@cluster user] $ bpsh 0 cmda > /tmp/output [user@cluster user] $ bpsh 0 cmda < /tmp/input

16 Chapter 2. Interacting With the System

Example 2-5. Command on Compute Node, Output on Compute Node

A bit more complex situation is to run the command on the compute node and do something with its input (or output) on that same compute node. There are two ways to accomplish this. The first solution requires that all the programs you run be on the compute node. For this to work, you must first copy the cmda and cmdb executable binaries to the compute node. Then you would use the following commands: [user@cluster user] $ bpsh 0 sh -c "cmda | cmdb" [user@cluster user] $ bpsh 0 sh -c "cmda > /tmp/output" [user@cluster user] $ bpsh 0 sh -c "cmda < /tmp/input" The second solution doesn’t require any of the programs to be on the compute node. However, it uses a lot of network bandwidth as it takes the output and sends it to the master node, then sends it right back to the compute node. The appropriate commands are as follows: [user@cluster user] $ bpsh 0 cmda | bpsh 0 cmdb [user@cluster user] $ bpsh 0 cmda | bpsh 0 dd of=/tmp/output [user@cluster user] $ bpsh 0 cat /tmp/input | bpsh 0 cmda

Example 2-6. Command on Master Node, Output on Compute Node

You can also run a command on the master node and do something with its input or output on the compute nodes. The appropriate commands are as follows: [user@cluster user] $ cmda | bpsh 0 cmdb [user@cluster user] $ cmda | bpsh 0 dd of=/tmp/output [user@cluster user] $ bpsh 0 cat /tmp/input | cmda

Copying Data to the Compute Nodes There are several ways to get data from the master node to the compute nodes. This section describes using NFS to share data, using the Scyld ClusterWare command bpcp to copy data, and using programmatic methods for data transfer.

Sharing Data via NFS The easiest way to transfer data to the compute nodes is via NFS. All files in your /home directory are shared by default to all compute nodes via NFS. Opening an NFS-shared file on a compute node will, in fact, open the file on the master node; no actual copying takes place.

Copying Data via bpcp To copy a file, rather than changing the original across the network, you can use the bpcp command. This works much like the standard Unix file-copying command cp, in that you pass it a file to copy as one argument and the destination as the next argument. Like the Unix scp, the file paths may be qualified by a computer host name.

17 Chapter 2. Interacting With the System

With bpcp, you can indicate the node number for the source file, destination file, or both. To do this, prepend the node number with a colon before the file name, to specify that the file is on that node or should be copied to that node. For example, to copy the file /tmp/foo to the same location on node 1, you would use the following command:

[user@cluster user] $ bpcp /tmp/foo 1:/tmp/foo

Programmatic Data Transfer The third method for transferring data is to do it programmatically. This is a bit more complex than the methods described in the previous section, and will only be described here only conceptually. If you are using an MPI job, you can have your Rank 0 process on the master node read in the data, then use MPI’s message passing capabilities to send the data over to a compute node. If you are writing a program that uses BProc functions directly, you can have the process first read the data while it is on the master node. When the process is moved over to the compute node, it should still be able to access the data read in while on the master node.

Data Transfer by Migration Another programmatic method for file transfer is to read a file into memory prior to calling BProc to migrate the process to another node. This technique is especially useful for parameter and configuration files, or files containing the intermediate state of a computation. See the Reference Guide for a description of the BProc system calls.

Monitoring and Controlling Processes One of the features of Scyld ClusterWare that isn’t provided in traditional Beowulf clusters is the BProc Distributed Process Space. BProc presents a single unified process space for the entire cluster, run from the master node, where you can see and control jobs running on the compute nodes. This process space allows you to use standard Unix tools, such as top, ps, and kill. See the Administrator’s Guide for more details on BProc. Scyld ClusterWare also includes a tool called bpstat that can be used to determine which node is running a process. Using the command option bpstat -p will list all processes currently running by PID, with the number of the node running each process. The following output is an example:

[user@cluster user] $ bpstat -p PID Node 6301 0 6302 1 6303 0 6304 2 6305 1 6313 2 6314 3 6321 3

18 Chapter 2. Interacting With the System

Using the command option bpstat -P (with an uppercase "P" instead of a lowercase "p") tells bpstat to take the output of the ps and reformat it, pre-pending a column showing the node number. The following two examples show the difference in the outputs from ps and from bpstat -P. Example output from ps:

[user@cluster user] $ ps xf PID TTY STAT TIME COMMAND 6503 pts/2 S 0:00 bash 6665 pts/2 R 0:00 ps xf 6471 pts/3 S 0:00 bash 6538 pts/3 S 0:00 /bin/sh /usr/bin/linpack 6553 pts/3 S 0:00 \_ /bin/sh /usr/bin/mpirun -np 5 /tmp/xhpl 6654 pts/3 R 0:03 \_ /tmp/xhpl -p4pg /tmp/PI6553 -p4wd /tmp 6655 pts/3 S 0:00 \_ /tmp/xhpl -p4pg /tmp/PI6553 -p4wd /tmp 6656 pts/3 RW 0:01 \_ [xhpl] 6658 pts/3 SW 0:00 | \_ [xhpl] 6657 pts/3 RW 0:01 \_ [xhpl] 6660 pts/3 SW 0:00 | \_ [xhpl] 6659 pts/3 RW 0:01 \_ [xhpl] 6662 pts/3 SW 0:00 | \_ [xhpl] 6661 pts/3 SW 0:00 \_ [xhpl] 6663 pts/3 SW 0:00 \_ [xhpl]

Example of the same ps output when run through bpstat -P instead:

[user@cluster user] $ ps xf | bpstat -P NODE PID TTY STAT TIME COMMAND 6503 pts/2 S 0:00 bash 6666 pts/2 R 0:00 ps xf 6667 pts/2 R 0:00 bpstat -P 6471 pts/3 S 0:00 bash 6538 pts/3 S 0:00 /bin/sh /usr/bin/linpack 6553 pts/3 S 0:00 \_ /bin/sh /usr/bin/mpirun -np 5 /tmp/xhpl 6654 pts/3 R 0:06 \_ /tmp/xhpl -p4pg /tmp/PI6553 -p4wd /tmp 6655 pts/3 S 0:00 \_ /tmp/xhpl -p4pg /tmp/PI6553 -p4wd /tmp 0 6656 pts/3 RW 0:06 \_ [xhpl] 0 6658 pts/3 SW 0:00 | \_ [xhpl] 1 6657 pts/3 RW 0:06 \_ [xhpl] 1 6660 pts/3 SW 0:00 | \_ [xhpl] 2 6659 pts/3 RW 0:06 \_ [xhpl] 2 6662 pts/3 SW 0:00 | \_ [xhpl] 3 6661 pts/3 SW 0:00 \_ [xhpl] 3 6663 pts/3 SW 0:00 \_ [xhpl]

For additional information on bpstat, see the section on monitoring node status earlier in this chapter. For information on the bpstat command line options, see the Reference Guide.

19 Chapter 2. Interacting With the System

20 Chapter 3. Running Programs

This chapter describes how to run both serial and parallel jobs with Scyld ClusterWare, and how to monitor the status of the cluster once your applications are running. It begins with a brief discussion of program execution concepts, including some examples. The discussion then covers running programs that aren’t parallelized, running parallel programs (including MPI- aware and PVM-aware programs), running serial programs in parallel, job batching, and file systems. Finally, the chapter covers the sample linpack and mpi-mandel programs included with Scyld ClusterWare.

Program Execution Concepts This section compares program execution on a stand-alone computer and a Scyld cluster. It also discusses the differences between running programs on a traditional Beowulf cluster and a Scyld cluster. Finally, it provides some examples of program execution on a Scyld cluster.

Stand-Alone Computer vs. Scyld Cluster On a stand-alone computer running Linux, Unix, and most other operating systems, executing a program is a very simple process. For example, to generate a list of the files in the current working directory, you open a terminal window and type the command ls followed by the [return] key. Typing the [return] key causes the command shell — a program that listens to and interprets commands entered in the terminal window — to start the ls program (stored at /bin/ls). The output is captured and directed to the standard output stream, which also appears in the same window where you typed the command. A Scyld cluster isn’t simply a group of networked stand-alone computers. Only the master node resembles the computing system with which you are familiar. The compute nodes have only the minimal software components necessary to support an application initiated from the master node. So for instance, running the ls command on the master node causes the same series of actions as described above for a stand-alone computer, and the output is for the master node only. However, running ls on a compute node involves a very different series of actions. Remember that a Scyld cluster has no resident applications on the compute nodes; applications reside only on the master node. So for instance, to run the ls command on compute node 1, you would enter the command bpsh 1 ls on the master node. This command sends ls to compute node 1 via Scyld’s BProc software, and the output stream is directed to the terminal window on the master node, where you typed the command. Some brief examples of program execution are provided in the last section of this chapter. Both BProc and bpsh are covered in more detail in the Administrator’s Guide.

Traditional Beowulf Cluster vs. Scyld Cluster A job on a Beowulf cluster is actually a collection of processes running on the compute nodes. In traditional clusters of computers, and even on earlier Beowulf clusters, getting these processes started and running together was a complicated task. Typically, the cluster administrator would need to do all of the following:

• Ensure that the user had an account on all the target nodes, either manually or via a script.

• Ensure that the user could spawn jobs on all the target nodes. This typically entailed configuring a hosts.allow file on each machine, creating a specialized PAM module (a Linux authentication mechanism), or creating a server daemon on each node to spawn jobs on the user’s behalf.

• Copy the program binary to each node, either manually, with a script, or through a network file system.

• Ensure that each node had available identical copies of all the dependencies (such as libraries) needed to run the program.

21 Chapter 3. Running Programs

• Provide knowledge of the state of the system to the application manually, through a configuration file, or through some add-on scheduling software.

With Scyld ClusterWare, most of these steps are removed. Jobs are started on the master node and are migrated out to the compute nodes via BProc. A cluster architecture where jobs may be initiated only from the master node via BProc provides the following advantages:

• Users no longer need accounts on remote nodes.

• Users no longer need authorization to spawn jobs on remote nodes.

• Neither binaries nor libraries need to be available on the remote nodes.

• The BProc system provides a consistent view of all jobs running on the system.

With all these complications removed, program execution on the compute nodes becomes a simple matter of letting BProc know about your job when you start it. The method for doing so depends on whether you are launching a parallel program (for example, an MPI job or PVM job) or any other kind of program. See the sections on running parallel programs and running non-parallelized programs later in this chapter.

Program Execution Examples This section provides a few examples of program execution with Scyld ClusterWare. Additional examples are provided in the sections on running parallel programs and running non-parallelized programs later in this chapter.

Example 3-1. Directed Execution with bpsh

In the directed execution mode, the user explicitly defines which node (or nodes) will run a particular job. This mode is invoked using the bpsh command, the ClusterWare shell command analogous in functionality to both the rsh (remote shell) and ssh (secure shell) commands. Following are two examples of using bpsh. The first example runs hostname on compute node 0 and writes the output back from the node to the user’s screen: [user@cluster user] $ bpsh 0 /bin/hostname n0 If /bin is in the user’s $PATH, then the bpsh does not need the full pathname: [user@cluster user] $ bpsh 0 hostname n0 The second example runs the /usr/bin/uptime utility on node 1. Assuming /usr/bin is in the user’s $PATH: [user@cluster user] $ bpsh 1 uptime 12:56:44 up 4:57, 5 users, load average: 0.06, 0.09, 0.03

Example 3-2. Dynamic Execution with beorun and mpprun

In the dynamic execution mode, Scyld decides which node is the most capable of executing the job at that moment in time. Scyld includes two parallel execution tools that dynamically select nodes: beorun and mpprun. They differ only in that beorun runs the job concurrently on the selected nodes, while mpprun runs the job sequentially on one node at a time. The following example shows the difference in the elapsed time to run a command with beorun vs. mpprun: [user@cluster user] $ date;beorun -np 8 sleep 1;date

22 Chapter 3. Running Programs

Fri Aug 18 11:48:30 PDT 2006 Fri Aug 18 11:48:31 PDT 2006 [user@cluster user] $ date;mpprun -np 8 sleep 1;date Fri Aug 18 11:48:46 PDT 2006 Fri Aug 18 11:48:54 PDT 2006

Example 3-3. Binary Pre-Staged on Compute Node

A needed binary can be "pre-staged" by copying it to a compute node prior to execution of a shell script. In the following example, the shell script is in a file called test.sh: ###### #! /bin/bash hostname.local #######

[user@cluster user] $ bpsh 1 mkdir -p /usr/local/bin [user@cluster user] $ bpcp /bin/hostname 1:/usr/local/bin/hostname.local [user@cluster user] $ bpsh 1 ./test.sh n1 This makes the hostname binary available on compute node 1 as /usr/local/bin/hostname.local before the script is executed. The shell’s $PATH contains /usr/local/bin, so the compute node searches locally for hostname.local in $PATH, finds it, and executes it. Note that copying files to a compute node generally puts the files into the RAM filesystem on the node, thus reducing main memory that might otherwise be available for programs, libraries, and data on the node.

Example 3-4. Binary Migrated to Compute Node

If a binary is not "pre-staged" on a compute node, the full path to the binary must be included in the script in order to execute properly. In the following example, the master node starts the process (in this case, a shell) and moves it to node 1, then continues execution of the script. However, when it comes to the hostname.local2 command, the process fails: ###### #! /bin/bash hostname.local2 #######

[user@cluster user] $ bpsh 1 ./test.sh ./test.sh: line 2: hostname.local2: command not found Since the compute node does not have hostname.local2 locally, the shell attempts to resolve the binary by asking for the binary from the master. The problem is that the master has no idea which binary to give back to the node, hence the failure. Because there is no way for Bproc to know which binaries may be needed by the shell, hostname.local2 is not migrated along with the shell during the initial startup. Therefore, it is important to provide the compute node with a full path to the binary: ###### #! /bin/bash /tmp/hostname.local2 #######

23 Chapter 3. Running Programs

[user@cluster user] $ cp /bin/hostname /tmp/hostname.local2 [user@cluster user] $ bpsh 1 ./test.sh n1 With a full path to the binary, the compute node can construct a proper request for the master, and the master knows which exact binary to return to the compute node for proper execution.

Example 3-5. Process Data Files

Files that are opened by a process (including files on disk, sockets, or named pipes) are not automatically migrated to compute nodes. Suppose the application BOB needs the data file 1.dat: [user@cluster user] $ bpsh 1 /usr/local/BOB/bin/BOB 1.dat 1.dat must be either pre-staged to the compute node, e.g., using bpcp to copy it there; or else the data files must be acces- sible on an NFS-mounted file system. The file /etc/beowulf/fstab (or a node-specific fstab.nodeNumber) specifies which filesystems are NFS-mounted on each compute node by default.

Example 3-6. Installing Commercial Applications

Through the course of its execution, the application BOB in the example above does some work with the data file 1.dat, and then later attempts to call /usr/local/BOB/bin/BOB.helper.bin and /usr/local/BOB/bin/BOB.cleanup.bin. If these binaries are not in the memory space of the process during migration, the calls to these binaries will fail. Therefore, /usr/local/BOB should be NFS-mounted to all of the compute nodes, or the binaries should be pre-staged using bpcp to copy them by hand to the compute nodes. The binaries will stay on each compute node until that node is rebooted. Generally for commercial applications, the administrator should have $APP_HOME NFS-mounted on the compute nodes that will be involved in execution. A general best practice is to mount a general directory such as /opt, and install all of the applications into /opt.

Environment Modules The ClusterWare env-modules environment-modules package provides for the dynamic modification of a user’s environ- ment via modulefiles. Each modulefile contains the information needed to configure the shell for an application, allowing a user to easily switch between applications with a simple module switch command that resets environment variables like PATH and LD_LIBRARY_PATH. A number of modules are already installed configuring application builds and execution with OpenMPI. See the Section called Pre-Requisites to Running OpenMPI. For more information about creating your own modules, see http://modules.sourceforge.net, or view the manpages man module and man modulefile.

24 Chapter 3. Running Programs Running Programs That Are Not Parallelized

Starting and Migrating Programs to Compute Nodes (bpsh) There are no executable programs (binaries) on the file system of the compute nodes. This means that there is no getty, no login, nor any shells on the compute nodes. Instead of the remote shell (rsh) and secure shell (ssh) commands that are available on networked stand-alone computers (each of which has its own collection of binaries), Scyld ClusterWare has the bpsh command. The following example shows the standard ls command running on node 2 using bpsh:

[user@cluster user] $ bpsh 2 ls -FC / bin/ dev/ home/ lib64/ proc/ sys/ usr/ bpfs/ etc/ lib/ opt/ sbin/ tmp/ var/

At startup time, by default Scyld ClusterWare exports various directories, e.g., /bin and /usr/bin, on the master node, and those directories are NFS-mounted by compute nodes. However, an NFS-accessible /bin/ls is not a requirement for bpsh 2 ls to work. Note that the /sbin directory also exists on the compute node. It is not exported by the master node by default, and thus it exists locally on a compute node in the RAM-based filesystem. bpsh 2 ls /sbin usually shows an empty directory. Nonetheless, bpsh 2 modprobe bproc executes successfully, even though which modprobe shows the command resides in /sbin/modprobe and bpsh 2 which modprobe fails to find the command on the compute node because its /sbin does not contain modprobe. bpsh 2 modprobe bproc works because the bpsh initiates a modprobe process on the master node, then forms a process memory image that includes the command’s binary and references to all its dynamically linked libraries. This process memory image is then copied (migrated) to the compute node, and there the references to dynamic libraries are remapped in the process address space. Only then does the modprobe command begin real execution. bpsh is not a special version of sh, but a special way of handling execution. This process works with any program. Be aware of the following:

• All three standard I/O streams — stdin, stdout, and stderr — are forwarded to the master node. Since some programs need to read standard input and will stop working if they’re run in the background, be sure to close standard input at invocation by using use the bpsh -n flag when you run a program in the background on a compute node.

• Because shell scripts expect executables to be present, and because compute nodes don’t meet this requirement, shell scripts should be modified to include the bpsh commands required to affect the compute nodes and run on the master node.

• The dynamic libraries are cached separately from the process memory image, and are copied to the compute node only if they are not already there. This saves time and network bandwidth. After the process completes, the dynamic libraries are unloaded from memory, but they remain in the local cache on the compute node, so they won’t need to be copied if needed again.

For additional information on the BProc Distributed Process Space and how processes are migrated to compute nodes, see the Administrator’s Guide.

Copying Information to Compute Nodes (bpcp) Just as traditional Unix has copy (cp), remote copy (rcp), and secure copy (scp) to move files to and from networked machines, Scyld ClusterWare has the bpcp command.

25 Chapter 3. Running Programs

Although the default sharing of the master node’s home directories via NFS is useful for sharing small files, it is not a good solution for large data files. Having the compute nodes read large data files served via NFS from the master node will result in major network congestion, or even an overload and shutdown of the NFS server. In these cases, staging data files on compute nodes using the bpcp command is an alternate solution. Other solutions include using dedicated NFS servers or NAS appliances, and using cluster file systems. Following are some examples of using bpcp. This example shows the use of bpcp to copy a data file named foo2.dat from the current directory to the /tmp directory on node 6:

[user@cluster user] $ bpcp foo2.dat 6:/tmp

The default directory on the compute node is the current directory on the master node. The current directory on the compute node may already be NFS-mounted from the master node, but it may not exist. The example above works, since /tmp exists on the compute node, but will fail if the destination does not exist. To avoid this problem, you can create the necessary destination directory on the compute node before copying the file, as shown in the next example. In this example, we change to the /tmp/foo directory on the master, use bpsh to create the same directory on the node 6, then copy foo2.dat to the node:

[user@cluster user] $ cd /tmp/foo [user@cluster user] $ bpsh 6 mkdir /tmp/foo [user@cluster user] $ bpcp foo2.dat 6:

This example copies foo2.dat from node 2 to node 3 directly, without the data being stored on the master node. As in the first example, this works because /tmp exists:

[user@cluster user] $ bpcp 2:/tmp/foo2.dat 3:/tmp

Running Parallel Programs

An Introduction to Parallel Programming APIs Programmers are generally familiar with serial, or sequential, programs. Simple programs — like "Hello World" and the basic suite of searching and sorting programs — are typical of sequential programs. They have a beginning, an execution sequence, and an end; at any time during the run, the program is executing only at a single point. A thread is similar to a sequential program, in that it also has a beginning, an execution sequence, and an end. At any time while a thread is running, there is a single point of execution. A thread differs in that it isn’t a stand-alone program; it runs within a program. The concept of threads becomes important when a program has multiple threads running at the same time and performing different tasks. To run in parallel means that more than one thread of execution is running at the same time, often on different processors of one computer; in the case of a cluster, the threads are running on different computers. A few things are required to make parallelism work and be useful: The program must migrate to another computer or computers and get started; at some point, the data upon which the program is working must be exchanged between the processes. The simplest case is when the same single-process program is run with different input parameters on all the nodes, and the results are gathered at the end of the run. Using a cluster to get faster results of the same non-parallel program with different inputs is called parametric execution.

26 Chapter 3. Running Programs

A much more complicated example is a simulation, where each process represents some number of elements in the system. Every few time steps, all the elements need to exchange data across boundaries to synchronize the simulation. This situation requires a message passing interface or MPI. To solve these two problems — program startup and message passing — you can develop your own code using POSIX interfaces. Alternatively, you could utilize an existing parallel application programming interface (API), such as the Message Passing Interface (MPI) or the Parallel Virtual Machine (PVM). These are discussed in the sections that follow.

MPI The Message Passing Interface (MPI) application programming interface is currently the post popular choice for writing parallel programs. The MPI standard leaves implementation details to the system vendors (like Scyld). This is useful because they can make appropriate implementation choices without adversely affecting the output of the program. A program that uses MPI is automatically started a number of times and is allowed to ask two questions: How many of us (size) are there, and which one am I (rank)? Then a number of conditionals are evaluated to determine the actions of each process. Messages may be sent and received between processes. The advantages of MPI are that the programmer:

• Doesn’t have to worry about how the program gets started on all the machines

• Has a simplified interface for inter-process messages

• Doesn’t have to worry about mapping processes to nodes

• Abstracts the network details, resulting in more portable hardware-agnostic software

Also see the section on running MPI-aware programs later in this chapter. Scyld ClusterWare includes several implementa- tions of MPI, including the following:

MPICH Scyld ClusterWare includes MPICH, a freely-available implementation of the MPI standard. MPICH is a project managed by Argonne National Laboratory and Mississippi State University. For more information on MPICH, visit the MPICH web site at http://www-Unix.mcs.anl.gov/mpi/mpich2. Scyld MPICH is modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter.

MVAPICH MVAPICH is an implementation of MPICH for Infiniband interconnects. As part of our Infiniband product, Scyld Clus- terWare provides a version of MVAPICH, modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter. For information on MVAPICH, visit the MVAPICH web site at http://nowlab.cse.ohio- state.edu/projects/mpi-iba3.

OpenMPI OpenMPI is an open-source implementation of the Message Passing Interface 2 (MPI-2) specification. The OpenMPI imple- mentation is an optimized combination of several other MPI implementations, and is likely to perform better than MPICH. For more information on OpenMPI, visit the OpenMPI web site at http://www.open-mpi.org4. Also see the section on run- ning OpenMPI programs later in this chapter.

27 Chapter 3. Running Programs

Other MPI Implementations Various commercial MPI implementations run on Scyld ClusterWare. See the Scyld MasterLink site at http://www.penguincomputing.com/support for more information. You can also download and build your own version of MPI, and configure it to run on Scyld ClusterWare.

PVM Parallel Virtual Machine (PVM) was an earlier parallel programming interface. Unlike MPI, it is not a specification but a single set of source code distributed on the Internet. PVM reveals much more about the details of starting your job on remote nodes. However, it fails to abstract implementation details as well as MPI does. PVM is deprecated, but is still in use by legacy code. We generally advise against writing new programs in PVM, but some of the unique features of PVM may suggest its use. Also see the section on running PVM-aware programs later in this chapter.

Custom APIs As mentioned earlier, you can develop you own parallel API by using various Unix and TCP/IP standards. In terms of starting a remote program, there are programs written:

• Using the rexec function call

• To use the rexec or rsh program to invoke a sub-program

• To use Remote Procedure Call (RPC)

• To invoke another sub-program using the inetd super server

These solutions come with their own problems, particularly in the implementation details. What are the network addresses? What is the path to the program? What is the account name on each of the computers? How is one going to load-balance the cluster? Scyld ClusterWare, which doesn’t have binaries installed on the cluster nodes, may not lend itself to these techniques. We recommend you write your parallel code in MPI. That having been said, we can say that Scyld has some experience with getting rexec() calls to work, and that one can simply substitute calls to rsh with the more cluster-friendly bpsh.

Mapping Jobs to Compute Nodes Running programs specifically designed to execute in parallel across a cluster requires at least the knowledge of the number of processes to be used. Scyld ClusterWare uses the NP environment variable to determine this. The following example will use 4 processes to run an MPI-aware program called a.out, which is located in the current directory.

[user@cluster user] $ NP=4 ./a.out

Note that each kind of shell has its own syntax for setting environment variables; the example above uses the syntax of the Bourne shell (/bin/sh or /bin/bash). What the example above does not specify is which specific nodes will execute the processes; this is the job of the map- per. Mapping determines which node will execute each process. While this seems simple, it can get complex as various

28 Chapter 3. Running Programs requirements are added. The mapper scans available resources at the time of job submission to decide which processors to use. Scyld ClusterWare includes beomap, a mapping API (documented in the Programmer’s Guide with details for writing your own mapper). The mapper’s default behavior is controlled by the following environment variables:

• NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.

• ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.

• ALL_NODES — Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.

• ALL_LOCAL — Run every process on the master node; used for debugging purposes.

• NO_LOCAL — Don’t run any processes on the master node.

• EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.

• BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.

You can use the beomap program to display the current mapping for the current user in the current environment with the current resources at the current time. See the Reference Guide for a detailed description of beomap and its options, as well as examples for using it.

Running MPICH Programs MPI-aware programs are those written to the MPI specification and linked with the Scyld MPICH library. This section discusses how to run MPI-aware programs and how to set mapping parameters from within such programs. Examples of running MPI-aware programs are also provided. For information on building MPI programs, see the Programmer’s Guide. mpirun Almost all implementations of MPI have an mpirun program, which shares the syntax of mpprun, but which boasts of additional features for MPI-aware programs. In the Scyld implementation of mpirun, all of the options available via environment variables or flags through directed execution are available as flags to mpirun, and can be used with properly compiled MPI jobs. For example, the command for running a hypothetical program named my-mpi-prog with 16 processes:

[user@cluster user] $ mpirun -np 16 my-mpi-prog arg1 arg2 is equivalent to running the following commands in the Bourne shell:

[user@cluster user] $ export NP=16 [user@cluster user] $ my-mpi-prog arg1 arg2

29 Chapter 3. Running Programs

Setting Mapping Parameters from Within a Program A program can be designed to set all the required parameters itself. This makes it possible to create programs in which the parallel execution is completely transparent. However, it should be noted that this will work only with Scyld ClusterWare, while the rest of your MPI program should work on any MPI platform. Use of this feature differs from the command line approach, in that all options that need to be set on the command line can be set from within the program. This feature may be used only with programs specifically designed to take advantage of it, rather than any arbitrary MPI program. However, this option makes it possible to produce turn-key application and parallel library functions in which the parallelism is completely hidden. Following is a brief example of the necessary source code to invoke mpirun with the -np 16 option from within a program, to run the program with 16 processes:

/* Standard MPI include file */ # include

main(int argc, char **argv) { setenv("NP","16",1); // set up mpirun env vars MPI_Init(&argc,&argv); MPI_Finalize(); }

More details for setting mapping parameters within a program are provided in the Programmer’s Guide.

Examples The examples in this section illustrate certain aspects of running a hypothetical MPI-aware program named my-mpi-prog.

Example 3-7. Specifying the Number of Processes

This example shows a cluster execution of a hypothetical program named my-mpi-prog run with 4 processes: [user@cluster user] $ NP=4 ./my-mpi-prog An alternative syntax is as follows: [user@cluster user] $ NP=4 [user@cluster user] $ export NP [user@cluster user] $ ./my-mpi-prog Note that the user specified neither the nodes to be used nor a mechanism for migrating the program to the nodes. The mapper does these tasks, and jobs are run on the nodes with the lowest CPU utilization.

Example 3-8. Excluding Specific Resources

In addition to specifying the number of processes to create, you can also exclude specific nodes as computing resources. In this example, we run my-mpi-prog again, but this time we not only specify the number of processes to be used (NP=6), but we also exclude of the master node (NO_LOCAL=1) and some cluster nodes (EXCLUDE=2:4:5) as computing resources. [user@cluster user] $ NP=6 NO_LOCAL=1 EXCLUDE=2:4:5 ./my-mpi-prog

30 Chapter 3. Running Programs

Running OpenMPI Programs OpenMPI programs are those written to the MPI-2 specification. This section provides information needed to use programs with OpenMPI as implemented in Scyld ClusterWare.

Pre-Requisites to Running OpenMPI A number of commands, such as mpirun, are duplicated between OpenMPI and other MPI implementations. Scyld Cluster- Ware provides the env-modules package which gives users a convenient way to switch between the various implementations. Be sure to load an OpenMPI module to favor OpenMPI, located in /usr/openmpi/, over the MPICH commands which are located in /usr/. Each module bundles together various compiler-specific environment variables to configure your shell for building and running your application, and for accessing compiler-specific manpages. Be sure that you are loading the proper module to match the compiler that built the application you wish to run. For example, to load the OpenMPI module for use with the Intel compiler, do the following:

[user@cluster user] $ module load openmpi/intel

Currently, there are modules for the GNU, Intel, and PGI compilers. To see a list of all of the available modules:

[user@cluster user] $ module avail openmpi ------/opt/modulefiles ------openmpi/gnu(default) openmpi/intel openmpi/pgi

For more information about creating your own modules, see http://modules.sourceforge.net and the manpages man module and man modulefile.

Using OpenMPI Unlike the Scyld ClusterWare MPICH implementation, OpenMPI does not honor the Scyld Beowulf job mapping environ- ment variables. You must either specify the list of hosts on the command line, or inside of a hostfile. To specify the list of hosts on the command line, use the -H option. The argument following -H is a comma separated list of hostnames, not node numbers. For example, to run a two process job, with one process running on node 0 and one on node 1:

[user@cluster user] $ mpirun -H n0,n1 -np 2 ./mpiprog

Support for running jobs over Infiniband using the OpenIB transport is included with OpenMPI distributed with Scyld ClusterWare. Much like running a job with MPICH over Infiniband, one must specifically request the use of OpenIB. For example:

[user@cluster user] $ mpirun --mca btl openib,sm,self -H n0,n1 -np 2 ./myprog

Read the OpenMPI mpirun man page for more information about, using a hostfile, and using other tunable options available through mpirun.

Running PVM-Aware Programs Parallel Virtual Machine (PVM) is an application programming interface for writing parallel applications, enabling a col- lection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Scyld has developed the Scyld PVM library, specifically tailored to allow PVM to take advantage of the technologies used in Scyld ClusterWare. A PVM-aware program is one that has been written to the PVM specification and linked against the Scyld PVM library.

31 Chapter 3. Running Programs

A complete discussion of cluster configuration for PVM is beyond the scope of this document. However, a brief introduction is provided here, with the assumption that the reader has some background knowledge on using PVM. You can start the master PVM daemon on the master node using the PVM console, pvm. To add a compute node to the virtual machine, issue an add .# command, where # is replaced by a node’s assigned number in the cluster.

Tip: You can generate a list of node numbers using either of the beosetup or bpstat commands.

Alternately, you can start the PVM console with a hostfile filename on the command line. The hostfile should contain a .# for each compute node you want as part of the virtual machine. As with standard PVM, this method automatically spawns PVM slave daemons to the specified compute nodes in the cluster. From within the PVM console, use the conf command to list your virtual machine’s configuration; the output will include a separate line for each node being used. Once your virtual machine has been configured, you can run your PVM applications as you normally would.

Porting Other Parallelized Programs Programs written for use on other types of clusters may require various levels of change to function with Scyld ClusterWare. For instance:

• Scripts or programs that invoke rsh can instead call bpsh.

• Scripts or programs that invoke rcp can instead call bpcp.

• beomap can be used with any script to load balance programs that are to be dispatched to the compute nodes.

For more information on porting applications, see the Programmer’s Guide

Running Serial Programs in Parallel For jobs that are not "MPI-aware" or "PVM-aware", but need to be started in parallel, Scyld ClusterWare provides the parallel execution utilities mpprun and beorun. These utilities are more sophisticated than bpsh, in that they can automatically select ranges of nodes on which to start your program, run tasks on the master node, determine the number of CPUs on a node, and start a copy on each CPU. Thus, mpprun and beorun provide you with true "dynamic execution" capabilities, whereas bpsh provides "directed execution" only. mpprun and beorun are very similar, and have similar parameters. They differ only in that mpprun runs jobs sequentially on the selected processors, while beorun runs jobs concurrently on the selected processors.

mpprun mpprun is intended for applications rather than utilities, and runs them sequentially on the selected nodes. The basic syntax of mpprun is as follows:

[user@cluster user] $ mpprun [options] app arg1 arg2...

where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.

32 Chapter 3. Running Programs

Options mpprun includes options for controlling various aspects of the job, including the ability to:

• Specify the number of processors on which to start copies of the program

• Start one copy on each node in the cluster

• Start one copy on each CPU in the cluster

• Force all jobs to run on the master node

• Prevent any jobs from running on the master node The most interesting of the options is the --map option, which lets the user specify which nodes will run copies of a program; an example is provided in the next section. This argument, if specified, overrides the mapper’s selection of resources that it would otherwise use. See the Reference Guide for a complete list of options for mpprun.

Examples Run 16 tasks of program app:

[user@cluster user] $ mpprun -np 16 app infile outfile

Run 16 tasks of program app on any available nodes except nodes 2 and 3:

[user@cluster user] $ mpprun -np 16 --exclude 2:3 app infile outfile

Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:

[user@cluster user] $ mpprun --map 4:2:1:5 app infile outfile

beorun beorun is intended for applications rather than utilities, and runs them concurrently on the selected nodes. The basic syntax of beorun is as follows:

[user@cluster user] $ beorun [options] app arg1 arg2... where app is the application program you wish to run; it need not be a parallel program. The arg arguments are the values passed to each copy of the program being run.

Options beorun includes options for controlling various aspects of the job, including the ability to:

• Specify the number of processors on which to start copies of the program

• Start one copy on each node in the cluster

• Start one copy on each CPU in the cluster

• Force all jobs to run on the master node

33 Chapter 3. Running Programs

• Prevent any jobs from running on the master node The most interesting of the options is the --map option, which lets the user specify which nodes will run copies of a program; an example is provided in the next section. This argument, if specified, overrides the mapper’s selection of resources that it would otherwise use. See the Reference Guide for a complete list of options for beorun.

Examples Run 16 tasks of program app:

[user@cluster user] $ beorun -np 16 app infile outfile

Run 16 tasks of program app on any available nodes except nodes 2 and 3:

[user@cluster user] $ beorun -np 16 --exclude 2:3 app infile outfile

Run 4 tasks of program app with task 0 on node 4, task 1 on node 2, task 2 on node 1, and task 3 on node 5:

[user@cluster user] $ beorun --map 4:2:1:5 app infile outfile

Job Batching

Job Batching Options for ClusterWare For Scyld ClusterWare HPC, the default installation includes the TORQUE resource manager, providing users an intuitive interface for for remotely initiating and managing batch jobs on distributed compute nodes. TORQUE is an open source tool based on standard OpenPBS. Basic instructions for using TORQUE are provided in the next section. For more general product information, see the TORQUE7 information page sponsored by Cluster Resources, Inc. (CRI). (Note that TORQUE is not included in the default installation of Scyld Beowulf Series 30.) Scyld also offers the Scyld TaskMaster Suite for clusters running Scyld Beowulf Series 30, Scyld ClusterWare HPC, and upgrades to these products. TaskMaster is a Scyld-branded and supported commercial scheduler and resource manager, developed jointly with Cluster Resources. For information on TaskMaster, see the Scyld TaskMaster Suite page in the HPC Clustering area of the Penguin web site8, or contact Scyld Customer Support. In addition, Scyld provides support for most popular open source and commercial schedulers and resource managers, in- cluding SGE, LSF, PBSPro, Maui and MOAB. For the latest information, see the Scyld MasterLink™9 support site.

Job Batching with TORQUE The default installation is configured as a simple job serializer with a single queue named batch. You can use the TORQUE resource manager to run jobs, check job status, find out which nodes are running your job, and find job output.

34 Chapter 3. Running Programs

Running a Job To run a job with TORQUE, you can put the commands you would normally use into a job script, and then submit the job script to the cluster using qsub. The qsub program has a number of options that may be supplied on the command line or as special directives inside the job script. For the most part, these options should behave exactly the same in a job script or via the command line, but job scripts make it easier to manage your actions and their results. Following are some examples of running a job using qsub. For more detailed information on qsub, see the qsub man page.

Example 3-9. Starting a Job with a Job Script Using One Node

The following script declares a job with the name "myjob", to be run using one node. The script uses the PBS -N directive, launches the job, and finally sends the current date and working directory to standard output. #!/bin/sh

## Set the job name #PBS -N myjob #PBS -l nodes=1

# Run my job /path/to/myjob echo Date: $ echo Dir: $PWD You would submit "myjob" as follows: [bjosh@iceberg]$ qsub -l nodes=1 myjob 15.iceberg

Example 3-10. Starting a Job from the Command Line

This example provides the command line equivalent of the job run in the example above. We enter all of the qsub options on the initial command line. Then qsub reads the job commands line-by-line until we type ^D, the end-of-file character. At that point, qsub queues the job and returns the Job ID. [bjosh@iceberg]$ qsub -N myjob -l nodes=1:ppn=1 -j oe cd $PBS_0_WORKDIR echo Date: $ echo Dir: $PWD ^D 16.iceberg

Example 3-11. Starting an MPI Job with a Job Script

The following script declares an MPI job named "mpijob". The script uses the PBS -N directive, prints out the nodes that will run the job, launches the job using mpiexec, and finally prints out the current date and working directory. When submitting MPI jobs using TORQUE, it is recommended to simply call mpirun without any arguments. mpirun will detect that it is being launched from within TORQUE and assure that the job will be properly started on the nodes TORQUE has assigned to the job. In this case, TORQUE will properly manage and track resources used by the job. ## Set the job name #PBS -N mpijob

# RUN my job

35 Chapter 3. Running Programs

mpirun /path/to/mpijob

echo Date: $ echo Dir: $PWD To request 8 total processors to run "mpijob", you would submit the job as follows: [bjosh@iceberg]$ qsub -l nodes=8 mpijob 17.iceberg To request 8 total processors, using 4 nodes, each with 2 processors per node, you would submit the job as follows: [bjosh@iceberg]$ qsub -l nodes=4:ppn=2 mpijob 18.iceberg

Checking Job Status You can check the status of your job using qstat. The command line option qstat -n will display the status of queued jobs. To watch the progression of events, use the watch command to execute qstat -n every 2 seconds by default; type [CTRL]-C to interrupt watch when needed.

Example 3-12. Checking Job Status

This example shows how to check the status of the job named "myjob", which we ran on 1 node in the first example above, using the option to watch the progression of events. [bjosh@iceberg]$ qsub myjob && watch qstat -n iceberg:

JobID Username Queue Jobname SessID NDS TSK ReqdMemory ReqdTime S ElapTime 15.iceberg bjosh default myjob -- 1 -- -- 00:01 Q --

Table 3-1. Useful Job Status Commands

Command Purpose ps -ef | bpstat -P Display all running jobs, with node number for each qstat -Q Display status of all queues qstat -n Display status of queued jobs qstat -f JOBID Display very detailed information about Job ID pbsnodes -a Display status of all nodes

Finding Out Which Nodes Are Running a Job To find out which nodes are running your job, use the following commands:

• To find your Job Ids: qstat -an

• To find the Process IDs of your jobs: qstat -f

• To find the number of the node running your job: ps -ef | bpstat -P | grep The number of the node running your job will be displayed in the first column of output.

36 Chapter 3. Running Programs

Finding Job Output When your job terminates, TORQUE will store its output and error streams in files in the script’s working directory.

• Default output file: .o

You can override the default using qsub with the -o option on the command line, or use the #PBS -o directive in your job script.

• Default error file: .e

You can override the default using qsub with the -e option on the command line, or use the #PBS -e directive in your job script.

• To join the output and error streams into a single file, use qsub with the -j oe option on the command line, or use the #PBS -j oe directive in your job script.

File Systems Data files used by the applications processed on the cluster may be stored in a variety of locations, including:

• On the local disk of each node

• On the master node’s disk, shared with the nodes through a network file system

• On disks on multiple nodes, shared with all nodes through the use of a parallel file system

The simplest approach is to store all files on the master node, as with the standard Network File System. Any files in your /home directory are shared via NFS with all the nodes in your cluster. This makes management of the files very simple, but in larger clusters the performance of NFS on the master node can become a bottleneck for I/O-intensive applications. If you are planning a large cluster, you should include disk drives that are separate from the system disk to contain your shared files; for example, place /home on a separate pair of RAID1 disks in the master node. A more scalable solution is to utilize a dedicated NFS server with a properly configured storage system for all shared files and programs, or a high performance NAS appliance. Storing files on the local disk of each node removes the performance problem, but makes it difficult to share data between tasks on different nodes. Input files for programs must be distributed manually to each of the nodes, and output files from the nodes must be manually collected back on the master node. This mode of operation can still be useful for temporary files created by a process and then later reused on that same node.

Sample Programs Included with Scyld ClusterWare

linpack The Linpack benchmark suite, used to evaluate computer performance, stresses a cluster by solving a random dense linear system, maximizing your CPU and network usage. Administrators use Linpack to evaluate the cluster fitness. For informa- tion on Linpack, see the Top 500 page at http://www.top500.org10.

37 Chapter 3. Running Programs

The linpack shell script provided with Scyld ClusterWare is a portable, non-optimized version of the High Performance Linpack (HPL) benchmark. It is intended for verification purposes only, and the results should not be used for performance characterization. Running the linpack shell script starts xhpl after creating a configuration/input file. If linpack doesn’t run to completion or takes too long to run, check for network problems, such as a bad switch or incorrect switch configuration.

Tip: The linpack default settings are too general to result in good performance on clusters larger than a few nodes; consult the file /usr/share/doc/hpl-1.0/TUNING for tuning tips appropriate to your cluster. A first step is to increase the problem size, set around line 15 to a default value of 3000. If this value is set too high, it will cause failure by memory starvation.

The following figure illustrates example output from linpack.

Figure 3-1. Testing Your Cluster with linpack

mpi-mandel mpi-mandel is a graphical, interactive demonstration of the Mandelbrot set. It is a classic MPI program, and therefore gives us a chance to use our mapping environment variables; the following example invocation will run on all available CPUs:

[user@cluster user] $ ALL_CPUS=1 /usr/bin/mpi-mandel

In the example below, the compute node (slave) count is displayed in the bottom-left corner. When run on x86 CPUs with a performance counter support-enabled kernel, mpi-mandel gives the number of integer and floating point calculations in the status bar at the bottom of the window.

38 Chapter 3. Running Programs

Figure 3-2. Demonstrating Your Cluster with mpi-mandel

To have mpi-mandel run as a free-flowing demonstration, you can load a favorites file, for example:

[user@cluster user] $ NP=4 mpi-mandel --demo /usr/share/doc/mpi-mandel-1.0.20a/mandel.fav

The output display will refresh more quickly if you turn off Favorites > delay. Consider restoring the delay if you are running a very non-homogeneous cluster, or if your video card isn’t fast enough, as either condition may result in a partial screen refresh.

Notes 1. http://modules.sourceforge.net 2. http://www-Unix.mcs.anl.gov/mpi/mpich/ 3. http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ 4. http://www.open-mpi.org/ 5. http://www.penguincomputing.com/support 6. http://modules.sourceforge.net 7. http://www.clusterresources.com/pages/products/torque-resource-manager.php 8. http://www.penguincomputing.com

39 Chapter 3. Running Programs

9. http://www.penguincomputing.com/support 10. http://www.top500.org/

40 Appendix A. Glossary of Parallel Computing Terms

Bandwidth A measure of the total amount of information delivered by a network. This metric is typically expressed in millions of bits per second (Mbps) for data rate on the physical communication media or megabytes per second (MBps) for the performance seen by the application.

Backplane Bandwidth The total amount of data that a switch can move through it in a given time, typically much higher than the bandwidth delivered to a single node.

Bisection Bandwidth The amount of data that can be delivered from one half of a network to the other half in a given time, through the least favorable halving of the network fabric.

Boot Image The file system and kernel seen by a compute node at boot time; contains enough drivers and information to get the system up and running on the network.

Cluster A collection of nodes, usually dedicated to a single purpose.

Compute Node Nodes attached to the master through an interconnection network, used as dedicated attached processors. With Scyld, users never need to directly log into compute nodes.

Data Parallel A style of programming in which multiple copies of a single program run on each node, performing the same instructions while operating on different data.

Efficiency The ratio of a program’s actual speed-up to its theoretical maximum.

FLOPS Floating-point operations per second, a key measure of performance for many scientific and numerical applications.

Grain Size, Granularity A measure of the amount of computation a node can perform in a given problem between communications with other nodes, typically defined as "coarse" (large amount of computation) or "fine" (small amount of computation). Granularity is a key in determining the performance of a particular process on a particular cluster.

High Availability Refers to level of reliability; usually implies some level of fault tolerance (ability to operate in the presence of a hardware failure).

41 Appendix A. Glossary of Parallel Computing Terms

Hub A device for connecting the NICs in an interconnection network. Only one pair of ports (a bus) can be active at any time. Modern interconnections utilize switches, not hubs.

Isoefficiency The ability of a process to maintain a constant efficiency if the size of the process scales with the size of the machine.

Jobs In traditional computing, a job is a single task. A parallel job can be a collection of tasks, all working on the same problem but running on different nodes.

Kernel The core of the operating system, the kernel is responsible for processing all system calls and managing the system’s physical resources.

Latency The length of time from when a bit is sent across the network until the same bit is received. Can be measured for just the network hardware (wire latency) or application-to-application (includes software overhead).

Local Area Network (LAN) An interconnection scheme designed for short physical distances and high bandwidth, usually self-contained behind a single router.

MAC Address On an Ethernet NIC, the hardware address of the card. MAC addresses are unique to the specific NIC, and are useful for identifying specific nodes.

Master Node Node responsible for interacting with users, connected to both the public network and interconnection network. The master node controls the compute nodes.

Message Passing Exchanging information between processes, frequently on separate nodes.

Middleware A layer of software between the user’s application and the operating system.

MPI The Message Passing Interface, the standard for producing message passing libraries.

MPICH A commonly used MPI implementation, built on the chameleon communications layer.

42 Appendix A. Glossary of Parallel Computing Terms

Network Interface Card (NIC) The device through which a node connects to the interconnection network. The performance of the NIC and the network it attaches to limit the amount of communication that can be done by a parallel program.

Node A single computer system (motherboard, one or more processors, memory, possibly a disk, network interface).

Parallel Programming The art of writing programs that are capable of being executed on many processors simultaneously.

Process An instance of a running program.

Process Migration Moving a process from one computer to another after the process begins execution.

PVM The Parallel Virtual Machine, a common message passing library that predates MPI.

Scalability The ability of a process to maintain efficiency as the number of processors in the parallel machine increases.

Single System Image All nodes in the system see identical system files, including the same kernel, libraries, header files, etc. This guarantees that a program that will run on one node will run on all nodes.

Socket A low-level construct for creating a connection between processes on a remote system.

Speedup A measure of the improvement in the execution time of a program on a parallel computer vs. a serial computer.

Switch A device for connecting the NICs in an interconnection network so that all pairs of ports can communicate simultaneously.

Version Skew The problem of having more than one version of software or files (kernel, tools, shared libraries, header files) on different nodes.

43 Appendix A. Glossary of Parallel Computing Terms

44