Beowulf Clusters

Beowulf Clusters

K. Cooper1

1Department of Mathematics Washington State University

2019 Beowulf Clusters Beowulf Clusters Using Multiple Nodes

Supercomputers are superexpensive, superproprietary Computers are now commodities Off-the-shelf machines limited to 8-12 cores A machine with 192 cores is not a commodity 24 machines with 8 cores each is a commodity Beowulf clusters Beowulf Clusters Beowulf Clusters Beowulf

1993 - Donald Becker and Thomas Sterling - NASA Goddard Proprietary −→ Proprietary programming Idea was to use off-the-shelf components Separate boxes linked by fast network First cluster was 16 486s - "channel bonded" 10Mbit Ethernet Beowulf Clusters Beowulf Clusters Beowulf - 21st Century

Changing Hardware does not change programming model PVM - First standard: 1995 Oak Ridge Nat’l Lab Might be better for heterogeneous networks MPI - Message Passing Interface First standard: 1994 MPICH (Argonne); LAM-MPI (Ohio); OpenMPI Might be better for uniform networks Beowulf Clusters Beowulf Clusters Beowulf Clusters Beowulf Clusters Cluster vs. Single Box

Cluster might have more memory Cluster message passing speed depends on interconnect Storage should be on a head node Beowulf Clusters Beowulf Clusters Cluster Requirements

Shared file system Shell system Scheduling system Synchronization system Beowulf Clusters Beowulf Shared Files NFS

Network File System Unix way of sharing file systems File system on head node can be mounted on all nodes All nodes can read and write from same disk, via network Slow ... but faster than SMB Beowulf Clusters Beowulf Shells Shells

Must be able to log into remote machines without password Could use rsh - insecure, old-fashioned Could use ssh - secure, but can be a pain Use ssh with RSA authentication Beowulf Clusters Beowulf Shells SSH

Set up RSA credentials ssh-agent $SHELL Add key ssh-add Test it ssh compute-0-0 Beowulf Clusters Beowulf Scheduling Portable Batch System

Need to schedule jobs according to available resources Batch: running without an interactive console Developed at NASA Ames Now many: OpenPBS, Torque, PBSPro Beowulf Clusters Beowulf Scheduling PBS tasks

Sort priority of requests (even NQS could do this) Locate and allocate sufficient processors of requested type Handle I/O on multiple nodes Beowulf Clusters Beowulf Scheduling PBS Script

#!/bin/bash

#PBS -l nodes=4:ppn=4 #PBS -o /home/kcooper/Work/prime.out #PBS -e /home/kcooper/Work/prime.err

/usr/local/-gcc/bin/mpirun -np 16 /home/kcooper/Work/prime2 1000000 Beowulf Clusters Beowulf Scheduling PBS Commands

-a time : time at which to run job -D path : path to root directory for job -e path : the full path to the destination for standard error -o path : the full path to the destination for standard output -l list : list of resources for job -m option : tells when to send an email about the job -M users : list of users to send mail to -N name : name of job -q queue : which queue to submit to Beowulf Clusters Beowulf Scheduling SGE Script

#!/bin/bash #$ -pe orte 12 #$ -N My_Job #$ -cwd #$ -j y #$ -S /bin/bash

cd /home/directory/M583/Primes export MPI_ROOT=/usr/lib64/openmpi/bin $MPI_ROOT/mpirun -np 12 prime2 10000000 Beowulf Clusters Beowulf Scheduling SGE Options

-pe Parallel Environment -N Job name -cwd Run job in current working directory -j Merge standard out and standard error streams -o Name of output file -S Shell to run in -l Resource options Beowulf Clusters Beowulf Scheduling PBS/SGE Commands

qsub - Submit a job to a queue qdel - Delete a job from the queue qstat - Check the status of the queue −→ Example Beowulf Clusters Beowulf MPD MPD

MPICH - MPI implementation developed at Argonne CH stand for Chameleon - portability layer MPICH2 - Newer MPICH implementation Includes mpd - Multi-Purpose Daemon manager for MPICH2 Beowulf Clusters Beowulf MPD MPD

Determines which processors do what Run an mpd on each node you want to use Form a "ring" Ring can be larger than you need Can use a root ring Beowulf Clusters Beowulf MPD MPD

mpd - start one mpd daemon mpdboot -n - start a ring mpdtrace - find out which nodes are in a ring mpdallexit - terminate a ring and its daemons Beowulf Clusters Beowulf MPD MPD

To use the root ring must have file .mpd.conf Contains one line: MPD_USE_ROOT_MPD=1 You have little or no control over where processes run Could instead start your own ring .mpd.conf must then have a secret word in it Put commands in batch file to run mpdboot etc. Beowulf Clusters Beowulf IB Infiniband

Becoming standard for HPC Forms actual connections between devices Direct write to memory Direct channel between processors Can be up to 600 Gb/s Beowulf Clusters Beowulf IB MVAPICH

MPICH over Infiniband (and other RDMA networks) (Remote Direct Memory Access) MPICH over VAPI Layer VAPI is structure developed by Mellanox In practice, just uses substitute libraries Faster communication Beowulf Clusters Beowulf IB Monitoring

Usually Ganglia −→ Example