VSC documentation Documentation Release 1.0

VSC (Vlaams Supercomputing Center)

Sep 03, 2021

Contents

1 Getting access 1 1.1 Required steps to get access...... 1 1.2 VSC accounts...... 1 1.3 How to request an account?...... 1 1.4 Next steps...... 4 1.5 Additional information...... 4

2 Access and data transfer 7 2.1 Logging in to a cluster...... 7 2.2 Data storage...... 13 2.3 Transferring data...... 16 2.4 GUI applications on the clusters...... 24 2.5 VPN...... 24

3 Software stack 25 3.1 Using the module system...... 25

4 Running jobs 31 4.1 Job script...... 31 4.2 Submitting and monitoring a job...... 32 4.3 Job output...... 37 4.4 Troubleshooting...... 38 4.5 Advanced topics...... 40

5 Software development 43 5.1 Programming paradigms...... 43 5.2 Development tools...... 50 5.3 Libraries...... 72 5.4 Integrating code with software packages...... 82

6 VSC hardware 87 6.1 Tier-2 hardware...... 87 6.2 Tier-1 hardware...... 103

7 Globus file and data sharing platform 107 7.1 What is Globus...... 107 7.2 Access...... 107

i 7.3 Managing and transferring files...... 109 7.4 Local endpoints...... 112 7.5 Data sharing...... 113 7.6 Manage Globus groups...... 114 7.7 Command Line Interface (CLI)...... 116 7.8 Glossary...... 117

8 Frequently asked questions (FAQs) 119 8.1 General questions...... 119 8.2 Access to the infrastructure...... 122 8.3 Running jobs...... 123 8.4 Software...... 129

Index 137

ii CHAPTER 1

Getting access

1.1 Required steps to get access

New users of the VSC clusters should take the following three steps to get access: 1. create a public/private key pair 2. apply for a VSC account 3. login to the cluster

1.2 VSC accounts

In order to use the infrastructure of the VSC, you need a VSC-user ID, also called a VSC account. Check the VSC website for conditions. All VSC-accounts start with the letters "vsc” followed by a five-digit number. The first digit gives information about your home institution. There is no relationship with your name, nor is the information about the link between VSC- accounts and your name publicly accessible. Your VSC account gives you access to most of the VSC Tier-2 infrastructure, though for some more specialized hardware you may have to ask access separately. The rationale is that we want to ensure that that specialized (usually rather expensive) hardware is used efficiently for the type of applications it was purchased for. Contact your local VSC coordinator to arrange access when required. For the main Tier-1 compute cluster you need to submit a project application (or you should be covered by a project application within your research group).

1.3 How to request an account?

Unlike your institute account, VSC accounts don’t use regular fixed passwords but a key pair consisting of a public an private key because that is a more secure authentication technique. To apply for a VSC account, you need a

1 VSC documentation Documentation, Release 1.0 public/private key pair.

1.3.1 Create a public/private key pair

A key pair consists of a private and a public key. 1. The private key is stored on the computer(s) you use to access the VSC infrastructure and always stays there. 2. The public key is stored on the VSC systems you want to access, allowing to prove your identity through the corresponding private key. How to generate such a key pair depends on your . We describe the generation of key pairs in the client sections for • , • Windows and • macOS (formerly OS X). Without a key pair, you won’t be able to apply for a VSC account.

Warning: It is clear from the above that it is very important to protect your private key well. Therefore: • You should choose a strong passphrase to protect your private key. • You should not share your key pair with other users. • If you have accounts at multiple supercomputer centers (or on other systems that use SSH), you should seriously consider using a different key pair for each of those accounts. In that way, if a key would get compromised, the damage can be controlled. • For added security, you may also consider to use a different key pair for each computer you use to access your VSC account. If your computer is stolen, it is then easy to disable access from that computer while you can still access your VSC account from all your other computers. The procedure is explained on a separate web page “access from multiple machines”.

Your VSC account is currently managed through your institute account.

1.3.2 Applying for the account

Once you have a valid public/private key pair, you can submit an account request. The following algorithm guides you to the appropriate approach.

Users from the KU Leuven and UHasselt association

UHasselt has an agreement with KU Leuven to run a shared infrastructure. Therefore the procedure is the same for both institutions. Who? Access is available for faculty, students (under faculty supervision), and researchers of the KU Leuven, UHasselt and their associations. How? • Researchers with a regular personnel account (u-number) can use the generic procedure.

2 Chapter 1. Getting access VSC documentation Documentation, Release 1.0

• If you are in one of the higher education institutions associated with KU Leuven, the generic procedure may not work. In that case, please e-mail [email protected] to get an account. You will have to provide a public ssh key generated as described above. • Lecturers of KU Leuven and UHasselt that need HPC access for giving their courses: The procedure requires action both from the lecturers and from the students. Lecturers should follow the specific procedure for lecturers, while the students should simply apply for the account through the generic procedure.

Users of Ghent University Association

All information about the access policy is available in English at the UGent HPC web pages. • Researchers can use the generic procedure. • Master students can also use the infrastructure for their master thesis work. The promotor of the thesis should first send a motivation to [email protected] and then the generic procedure should be followed (using your student UGent id) to request the account.

Users of the Antwerp University Association (AUHA)

Who? Access is available for faculty, students (master’s projects under faculty supervision), and researchers of the AUHA. How? • Researchers can use the generic procedure. • Master students can also use the infrastructure for their master thesis work. The promotor of the thesis should first send a motivation to [email protected] and then the generic procedure should be followed (using your student UAntwerpen id) to request the account.

Users of Brussels University Association

All information about the access policy is available on the VUB HPC documentation website. • Researchers can use the generic procedure. • Master students can also use the infrastructure for their master thesis work. The promotor of the thesis should first send a motivation to [email protected] and then the generic procedure should be followed to request the account.

Everyone else

Who? Check that you are eligible to use VSC infrastructure. How? Ask your VSC contact for help. If you don’t have a VSC contact yet, and please get in touch with us.

1.3. How to request an account? 3 VSC documentation Documentation, Release 1.0

Generic procedure for academic researchers

For most researchers from the Flemish universities, the procedure has been fully automated and works by using your institute account to request a VSC account. Check below for exceptions or if the generic procedure does not work. 1. Open the VSC account page. 2. Select your “home” institution from the drop-down menu and click the “confirm” button. 3. Log in using your institution login and password. 4. You will be asked to upload the public key you created earlier. 5. You will get an e-mail to confirm your application, click the included link to do so. 6. After the account has been approved by the VSC, your account will be created and you will get a confirmation e-mail.

Warning: Allow for at least half an hour for your account to be properly created after receiving the confirmation email!

Note: If you can’t connect to the VSC account page , some browser extensions have caused problems (and in particular some security-related extensions), so you might try with browser extensions disabled.

1.4 Next steps

Register for an HPC Introduction course. These are organized at all universities on a regular basis.

Note: For KU Leuven users: if there is no course announced please register to our training waiting list and we will organize a new session as soon as we get a few people interested in it. training waiting list

Information on our training program and the schedule is available on the VSC website.

1.5 Additional information

Before you apply for VSC account, it is useful to first check whether the infrastructure is suitable for your application. Windows or macOS programs for instance cannot run on our infrastructure as we use the Linux operating system on the clusters. The infrastructure also should not be used to run applications for which the compute power of a good laptop is sufficient. The pages on the Tier-1 hardware and Tier-2 hardware in this part of the website give a high-level description of our infrastructure. You can find more detailed information in the user documentation on the user portal. When in doubt, you can also contact your local support team. This does not require a VSC account. Your account also includes two “blocks” of disk space: your home directory and data directory. Both are accessible from all VSC clusters. When you log in to a particular cluster, you will also be assigned one or more blocks of temporary disk space, called scratch directories. Which directory should be used for which type of data, is explained in the page “Where can I store what kind of data?”. Your VSC account does not give you access to all available software. You can use all free software and a number of compilers and other development tools. For most commercial software, you must first prove that you have a valid

4 Chapter 1. Getting access VSC documentation Documentation, Release 1.0 license or the person who has paid the license on the cluster must allow you to use the license. For this you can contact your local support team.

1.5. Additional information 5 VSC documentation Documentation, Release 1.0

6 Chapter 1. Getting access CHAPTER 2

Access and data transfer

Before you can really start using one of the clusters, there are several things you need to do or know:

2.1 Logging in to a cluster

You need to log on to a cluster via an ssh-client to one of the login nodes. This will give you a command line. The software you’ll need to use on your client system depends on its operating system:

2.1.1 Windows client

Getting ready to request an account

Before requesting an account, you need to generate a pair of SSH keys. One popular way to do this on Windows is using the freely available PuTTY client which you can then also use to log on to the clusters, see the instructions for generating a key pair with PuTTY. Another popular way is using the (also freely available) MobaXterm client, see the instructions for generating a key pair with MobaXterm.

Connecting to the cluster

Text-mode session using an SSH client

PuTTY is a simple-to-use and freely available GUI SSH client for Windows that is easy to set up. Pageant can be used to manage active keys for PuTTY, WinSCP, FileZilla as well as the NX client for Windows so that you don’t need to enter the passphrase all the time. Pageant is part of the PuTTY distribution. To establish network communication between your local machine and a compute node of a cluster you have to create an SSH tunnel using PuTTY . This is also useful to run client software on your Windows machine, e.g., ParaView or Jupyter notebooks that run on a compute node.

7 VSC documentation Documentation, Release 1.0

Transfer data using Secure FTP (SFTP) clients

Two GUI clients SFTP clients for Windows are recommended: • FileZilla • WinSCP

Display graphical programs

X server

You can install an X server: . X is the protocol that is used by most Linux applications to display graphics on a local or remote screen. Alternatively, you can use MobaXterm.

NX client

On the KU Leuven/UHasselt clusters it is also possible to use the NX Client to log on to the machine and run graphical programs. Instead of an X server, another piece of client software is required.

VNC

The KU Leuven/UHasselt, UAntwerp, and VUB clusters also offer support for visualization software through Virtual Network Computing (VNC). VNC renders images on the cluster and transfers the resulting images to your client device. VNC clients are available for Windows, macOS, Linux, Android and iOS. • On the UAntwerp clusters, TurboVNC is supported on all regular login nodes (without OpenGL support) and on the visualization node of Leibniz (with OpenGL support through VirtualGL). See the page “Remote visualization UAntwerp” for instructions. • On the VUB clusters, TigerVNC is supported on all nodes. See our documentation on running graphical appli- cations for instructions.

Alternatives

MobaXterm is a free and easy to use SSH client for Windows that has text-mode, a graphical file browser, an X server, an SSH agent, and more, all in one. No installation is required when using the Portable edition. See detailed instructions on how to setup MobaXterm. Recent versions of Windows come with an OpenSSH installed, and you can use it from PowerShell or the Command Prompt as you would in the termial on Linux systems and all pages about SSH and data transfer from the Linux client pages apply. The Windows Subsystem for Linux can be an alternative if you are using Windows 10 build 1607 or later. The available Linux distributions have SSH clients, so you can refer to all pages about SSH and data transfer from the Linux client pages as well.

8 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

Programming tools

Warning: Although it is convenient to develop software on your local machine, you should bear in mind that the hardware architecture is likely to differ substantially from the VSC HPC hardware. Therefore it is recommended that performance optimizations are done on the target system.

Windows Subsystem for Linux (WSL/WSL2)

If you’re running Windows 10 build 1607 (Anniversary Edition) or later, you may consider running the “Windows Subsystem for Linux” that will give you a Ubuntu-like environment on Windows and allow you to install some Ubuntu packages. In build 1607 this is still considered experimental technology and we offer no support.

Microsoft Visual Studio

Microsoft Visual Studio can also be used to develop OpenMP or MPI programs. If you do not use any Microsoft- specific libraries but stick to plain or C++, the programs can be recompiled on the VSC clusters. Microsoft is slow in implementing new standards though. In Visual Studio 2015, OpenMP support is still stuck at version 2.0 of the standard. An alternative is to get a license for the Intel compilers which plug into Visual Studio and give you the best of both worlds, the power of a full-blown IDE and compilers that support the latest technologies in the HPC world on Windows.

Eclipse

Eclipse is a popular multi-platform Integrated Development Environment (IDE) very well suited for code development on clusters. • Read our Eclipse introduction to find out why you should consider using Eclipse if you develop code and how to get it. • You can use Eclipse on the desktop as a remote editor for the cluster. • You can combine the remote editor feature with version control from Eclipse, but some care is needed, and here’s how to do it. On Windows Eclipse relies by default on the Cygwin toolchain for its compilers and other utilities, so you need to install that too.

Version control

Information on tools for version control (git and subversion) is available on the Version control systems introduction page.

2.1.2 macOS client

Since all VSC clusters use Linux as their main operating system, you will need to get acquainted with using the command-line interface and using the Terminal. To open a Terminal window in macOS (formerly OS X), choose Applications > Utilities > Terminal in the Finder. If you don’t have any experience with using the Terminal, we suggest you to read the basic Linux usage section first (which also applies to macOS).

2.1. Logging in to a cluster 9 VSC documentation Documentation, Release 1.0

Getting ready to request an account

Before requesting an account, you need to generate a pair of ssh keys. One popular way to do this on macOS is using the OpenSSH client included with macOS which you can then also use to log on to the clusters.

Connecting to the machine

Text-mode session using an SSH client

To get terminal-based access to a remote system, you can use • the OpenSSH ssh command, or • the JellyfiSSH GUI client.

Transfer data using Secure FTP (SFTP)

Data can be transferred using • Secure FTP (SFTP) with the OpenSSH sftp and scp commands, or • GUI clients such as Cyberduck or FileZilla.

Display graphical programs

X server

Linux programs use the X protocol to display graphics on local or remote screens. To use your Mac as a remote screen, you need to install a X server. XQuartz is one that is freely available. Once the X server is up and running, you can simply open a terminal window and connect to the cluster using the command line SSH client in the same way as you would on Linux.

NX client

On the KU Leuven/UHasselt clusters it is possible to use the NX Client to log on to the machine and run graphical programs. Instead of an X-server, another piece of client software is required.

VNC

The KU Leuven/UHasselt, UAntwerp, and VUB clusters also offer support for visualization software through Virtual Network Computing (VNC). VNC renders images on the cluster and transfers the resulting images to your client device. VNC clients are available for Windows, macOS, Linux, Android and iOS. • On the UAntwerp clusters, TurboVNC is supported on all regular login nodes (without OpenGL support) and on the visualization node of Leibniz (with OpenGL support through VirtualGL). See the page “Remote visualization UAntwerp” for instructions. • On the VUB clusters, TigerVNC is supported on all nodes. See our documentation on running graphical appli- cations for instructions.

10 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

Software development

Eclipse

Eclipse is a popular multi-platform Integrated Development Environment (IDE) very well suited for code development on clusters. • Read our Eclipse introduction to find out why you should consider using Eclipse if you develop code and how to get it. To get the full functionality of the Parallel Tools Platform and Fortran support on macOS, you need to install some additional software and start Eclipse in a special way as we explain here. • You can use Eclipse on the desktop as a remote editor for the cluster. • You can combine the remote editor feature with version control from Eclipse, but some care is needed, and here’s how to do it.

Version control

Most popular version control systems, including Subversion and git, are supported on macOS. See our introduction to version control systems.

2.1.3 Linux client

Since all VSC clusters use Linux as their main operating system, you will need to get acquainted with Linux using the command-line interface and using the terminal. To open a terminal in Linux when using KDE, choose Applications > System > Terminal > Konsole. When using GNOME, choose Applications > Accessories > Terminal. If you don’t have any experience with using the command-line interface in Linux, we suggest you to read the basic Linux usage section first.

Getting ready to request an account

Before requesting an account, you need to generate a pair of ssh keys. One popular way to do this on Linux is using the freely available OpenSSH client which you can then also use to log on to the clusters.

Connecting to the cluster

Text-mode session using an SSH client

The OpenSSH ssh command can be used to open a connection in a Linux terminal session. It is convenient to use an SSH-agent to avoid having to enter your private key’s passphrase all the time when estab- lishing a new connection. The SSH configuration file .ssh/config can be used to define connection properties for nodes you often use. It is a considerable time saver when working terminal-based. To establish network communication between your local machine and the cluster otherwise blocked by firewalls, you have to create an SSH tunnel using OpenSSH.

Transfer data using Secure FTP (SFTP)

Data can easily be transferred to and from remote systems using the OpenSSH sftp and scp commands.

2.1. Logging in to a cluster 11 VSC documentation Documentation, Release 1.0

Display graphical programs

X server

No extra software is needed on a Linux client system, but you need to use the appropriate options with the ssh command as explained on the page on OpenSSH.

NX client

On the KU Leuven/UHasselt clusters it is also possible to use the NX Client to log on to the machine and run graphical programs. This requires additional client software that is currently available for Windows, macOS, Linux, Android and iOS. The advantage over displaying X programs directly on your Linux screen is that you can sleep your laptop, disconnect and move to another network without loosing your X-session. Performance may also be better with many programs over high-latency networks.

VNC

The KU Leuven/UHasselt, UAntwerp, and VUB clusters also offer support for visualization software through Virtual Network Computing (VNC). VNC renders images on the cluster and transfers the resulting images to your client device. VNC clients are available for Windows, macOS, Linux, Android and iOS. • On the UAntwerp clusters, TurboVNC is supported on all regular login nodes (without OpenGL support) and on the visualization node of Leibniz (with OpenGL support through VirtualGL). See the page “Remote visualization UAntwerp” for instructions. • On the VUB clusters, TigerVNC is supported on all nodes. See our documentation on running graphical appli- cations for instructions.

Software development

Eclipse

Eclipse is a popular multi-platform Integrated Development Environment (IDE) very well suited for code development on clusters. • Read our Eclipse introduction to find out why you should consider using Eclipse if you develop code and how to get it. • You can use Eclipse on the desktop as a remote editor for the cluster. • You can combine the remote editor feature with version control from Eclipse, but some care is needed, and here’s how to do it.

Version control

Linux supports all popular version control systems. See our introduction to version control systems.

12 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

2.2 Data storage

Your account also comes with a certain amount of data storage capacity in at least three subdirectories on each cluster. You’ll need to familiarise yourself with

2.2.1 Where can I store what kind of data?

Data on the VSC clusters can be stored in several locations, depending on the size and usage of these data. Following locations are available: Home directory • Location available as $VSC_HOME • The data stored here should be relatively small, and not generating very intense I/O during jobs. Its main purpose is to stora all kinds of configuration files are stored, e.g., .bashrc, or MATLAB, and Eclipse configuration, . . . • Performance is tuned for the intended load: reading configuration files etc. • Readable and writable on all VSC sites. • As best practice, the permissions on your home directory should be only for yourself, i.e., 700. To share data with others, use the data directory. Data directory • Location available as $VSC_DATA • A bigger ‘workspace’, for program code, datasets or results that must be stored for a longer period of time. • There is no performance guarantee; depending on the cluster, performance may not be very high. • Readable and writable on all VSC sites. Scratch directories • Several types exist, available in $VSC_SCRATCH_XXX variables • For temporary or transient data; there is typically no backup for these filesystems, and ‘old’ data may be removed automatically. • Currently, $VSC_SCRATCH_NODE and $VSC_SCRATCH_SITE are defined for space that is available per node or per site on all nodes of the VSC. • These file systems are not exported to other VSC sites. Since these directories are not necessarily mounted on the same locations over all sites, you should always (try to) use the environment variables that have been created. Quota is enabled on the three directories, which means the amount of data you can store here is limited by the operating system, and not just by the capacity of the disk system, to prevent that the disk system fills up accidentally. You can see your current usage and the current limits with the appropriate quota command as explained on the page on managing disk space. The actual disk capacity, shared by all users, can be found on the Available hardware page. You will only receive a warning when you reach the soft limit of either quota. You will only start losing data when you reach the hard limit. Data loss occurs when you try to save new files: this will not work because you have no space left, and thus you will lose these new files. You will however not be warned when data loss occurs, so keep an eye open for the general quota warnings! The same holds for running jobs that need to write files: when you reach your hard quota, jobs will crash.

2.2. Data storage 13 VSC documentation Documentation, Release 1.0

Home directory

This directory is where you arrive by default when you login to the cluster. Your refers to it as "~” (tilde), or via the environment variable $VSC_HOME. The data stored here should be relatively small (e.g., no files or directories larger than a gigabyte, although this is not imposed automatically), and usually used frequently. The typical use is storing configuration files, e.g., by MATLAB, Eclipse, . . . The operating system also creates a few files and folders here to manage your account. Examples are:

.ssh/ This directory contains some files necessary for you to login to the cluster and to submit jobs on the cluster. Do not remove them, and do not alter anything if you don’t know what you’re doing! .profile This script defines some general settings about your sessions, .bash_profile .bashrc This script is executed every time you start a session on the cluster: when you login to the cluster and when a job starts. You could edit this file to define variables and aliases. However, note that loading modules is strongly discouraged. .bash_historyThis file contains the commands you typed at your shell prompt, in case you need them again.

Data directory

In this directory you can store all other data that you need for longer terms. The environment variable pointing to it is $VSC_DATA. There are no guarantees about the speed you’ll achieve on this volume. I/O-intensive programs should not run directly from this volume (and if you’re not sure, whether your program is I/O-intensive, don’t run from this volume). This directory is also a good location to share subdirectories with other users working on the same research projects.

Scratch space

To enable quick writing from your job, a few extra file systems are available on the work nodes. These extra file systems are called scratch folders, and can be used for storage of temporary and/or transient data (temporary results, anything you just need during your job, or your batch of jobs). You should remove any data from these systems after your processing them has finished. There are no guarantees about the time your data will be stored on this system, and we plan to clean these automatically on a regular base. The maximum allowed age of files on these scratch file systems depends on the type of scratch, and can be anywhere between a day and a few weeks. We don’t guarantee that these policies remain forever, and may change them if this seems necessary for the healthy operation of the cluster. Each type of scratch has his own use: Shared scratch ($VSC_SCRATCH) To allow a job running on multiple nodes (or multiple jobs running on separate nodes) to share data as files, every node of the cluster (including the login nodes) has access to this shared scratch directory. Just like the home and data directories, every user has its own scratch directory. Because this scratch is also available from the login nodes, you could manually copy results to your data directory after your job has ended. Different clusters on the same site may or may not share the scratch space pointed to by $VSC_SCRATCH. This scratch space is provided by a central file server that contains tens or hundreds of disks. Even though it is shared, it is usually very fast as it is very rare that all nodes would do I/O simultaneously. It also implements a parallel file system that allows a job to do parallel file I/O from multiple processes to the same file simultaneously, e.g., through MPI parallel I/O. For most jobs, this is the best scratch system to use.

14 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

Site scratch ($VSC_SITE_SCRATCH) A variant of the previous one, may or may not be the same. On clusters that have access to both a cluster-local scratch and site-wide scratch file system, this variable will point to the site-wide available scratch volume. On other sites it will just point to the same volume as $VSC_SCRATCH. Node scratch ($VSC_SCRATCH_NODE) Every node has its own scratch space, which is completely separated from the other nodes. On many cluster nodes, this space is provided by a local hard drive or SSD. Every job automatically gets its own temporary directory on this node scratch, available through the environment variable $TMPDIR. $TMPDIR is guaranteed to be unique for each job. Note however that when your job requests multiple cores and these cores happen to be in the same node, this $TMPDIR is shared among the cores! Also, you cannot access this space once your job has ended. And on a supercomputer, a local hard disk may not be faster than a remote file system which often has tens or hundreds of drives working together to provide disk capacity. Global scratch ($VSC_SCRATCH_GLOBAL) We may or may not implement a VSC-wide scratch volume in the future, and the environment variable VSC_SCRATCH_GLOBAL is reserved to point to that scratch volume. Currently is just points to the same volume as $VSC_SCRATCH or $VSC_SITE_SCRATCH.

2.2.2 How much disk space am I using?

Total disk space used on file systems with quota

On file systems with quota enabled, you can check the amount of disk space that is available for you, and the amount of disk space that is in use by you. On most systems, myquota will show you for the $VSC_HOME, $VSC_DATA and $VSC_SCRATCH file systems either the percentage of the available disk space you are using, or the absolute amount. Users from Ghent university should check their disk usage using the web application. If quota have been set on the number of files you can create on a file system, those are listed as well. Example:

$ myquota file system $VSC_DATA using 35G of 75G, 1126k of 10000k files file system $VSC_HOME using 2401M of 3072M, 40342 of 100k files file system $VSC_SCRATCH using 5.82G of 100G

Warning: If your file usage approaches the limits, jobs may crash unexpectedly.

Disk space used by individual directories

Warning: The du command will stress the file system, and all file systems are shared, so please use it wisely and sparingly.

The command to check the size of all subdirectories in the current directory is du:

$ du -h 4.0k ./.ssh (continues on next page)

2.2. Data storage 15 VSC documentation Documentation, Release 1.0

(continued from previous page) 0 ./somedata/somesubdir 52.0k ./somedata 56.0k .

This shows you first the aggregated size of all subdirectories, and finally the total size of the current directory “.” (this includes files stored in the current directory). The -h option ensures that sizes are displayed in human-readable form (kB, MB, GB), omitting it will show sizes in bytes. If the directory contains a deep hierarchy of subdirectories, you may not want to see the information at that depth; you could just ask for a summary of the current directory:

$ du -s 54864 .

If you want to see the size of any file or top level subdirectory in the current directory, you could use the following command: du-s * 12a.out 3564 core 4 mpd.hosts 51200 somedata 4 start.sh 4 test

Finally, if you don’t want to know the size of the data in your current directory, but in some other directory (e.g., your data directory), you just pass this directory as a parameter: du -h -s $VSC_DATA/input_data/* 50M /data/leuven/300/vsc30001/input_data/somedata

2.3 Transferring data

Before you can do some work, you’ll have to transfer the files that you need from your desktop or department to the cluster. At the end of a job, you might want to transfer some files back. The preferred way to do that, is by using an sftp client. It again requires some software on your client system which depends on its operating system:

2.3.1 Data transfer with FileZilla

FileZilla is an easy-to-use freely available ftp-style program to transfer files to and from your account on the clusters. You can also put FileZilla with your private key on a USB stick to access your files from any internet-connected PC. You can download FileZilla from the FileZilla project page.

Configuration of FileZilla to connect to a login node

Note: Pageant should be running and your private key should be loaded first (more info on our “using Pageant” page). 1. Start FileZilla; 2. Open the Site Manager using the ‘File’ menu; 3. Create a new site by clicking the New Site button;

16 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

4. In the marked General, enter the following values (all other fields remain blank): • Host: fill in the hostname of the VSC login node of your home institution. You can find this information in the overview of available hardware on this site. • Servertype: SFTP - SSH File Transfer Protocol • Logontype: Normal • User: your own VSC user ID, e.g., vsc98765; 5. Optionally, rename this setting to your liking by pressing the ‘Rename’ button; 6. Press ‘Connect’ and enter your passphrase when requested.

Note that recent versions of FileZilla have a screen in the settings to manage private keys. The path to the private key must be provided in options (Edit Tab -> options -> connection -> SFTP):

2.3. Transferring data 17 VSC documentation Documentation, Release 1.0

After that you should be able to connect after being asked for passphrase. As an alternative you can choose to use putty pageant.

2.3.2 Data transfer using WinSCP

Prerequisite: WinSCP

To transfer files to and from the cluster, we recommend the use of WinSCP, which is a graphical ftp-style program (but than one that uses the ssh protocol to communicate with the cluster rather then the less secure ftp) that is also freely available. WinSCP can be downloaded both as an installation package and as a standalone portable executable. When using the portable version, you can copy WinSCP together with your private key on a USB stick to have access to your files from any internet-connected Windows PC. WinSCP also works together well with the PuTTY suite of applications. It uses the keys generated with the PuTTY key generation program, can launch terminal sessions in PuTTY and use ssh keys managed by pageant.

Transferring your files to and from the VSC clusters

The first time you make the connection, you will be asked to ‘Continue connecting and add host key to the cache’; select ‘Yes’. 1. When you first install WinSCP it should open a new session dialog. If that does not happen - start WinSCP and go the “Session” tab. From there choose “New Session”. Fill in the following information:

18 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

1. Fill in the hostname of the VSC login node of your home institution. You can find this information in the overview of available hardware on this site. 2. Fill in your VSC username. 3. Double check that the port number is 22. 2. If you are not using pageant to manage your ssh keys, you have to point WinSCP to the private key file (in PuTTY .ppk format) that should be used. You can do that using “Advanced” button and then choose “SSH” “Authentication” from the list. When using pageant, you can leave this field blank.

2.3. Transferring data 19 VSC documentation Documentation, Release 1.0

3. If you want to store this data for later use, click the “Save” button and enter a name for the session. Next time you’ll start WinSCP, you’ll get a screen with stored sessions that you can open by selecting them and clicking the “Login” button.

4. Click the “Login” button to start the session that you just created. You’ll be asked for your passphrase if pageant is not running with a valid key loaded. The first time you make the connection, you will be asked to “Continue connecting and add host key to the cache”; select “Yes”.

Some remarks

20 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

Two interfaces

WinSCP has two modes for the : • The “commander mode” where you get a window with two columns, with the local directory in the left column and the host directory (remote directory) in the right column. You can then transfer files by dragging them from one column to the other. • The “explorer mode” where you only see the remote directory. You can transfer files by dragging them to and from other folder windows or the desktop. The default mode is “commander”. You can always switch the modes by going to the “Options” tab, choosing “Pref- erences” and selecting the "Environment\Interface” category.

Enable logging

When you experience trouble transferring files using WinSCP, the support team may ask you to enable logging and mail the results. 1. To enable logging:

2.3. Transferring data 21 VSC documentation Documentation, Release 1.0

1. Go to the “Options” tab and choose “Preferences”. 2. Select the “Logging” category. 3. Check the box next to “Enable session logging on level” and select the logging level requested by the user support team. Often normal logging will be sufficient. 4. Enter a name and directory for the log file. The default is %TEMP%\!S.log which will expand to a name that is system-dependent and depends on the name of your WinSCP session. %TEMP% is a Windows environment variable pointing to a directory for temporary files which on most systems is well hidden. !S will expand to the name of your session (for a stored session the name you used there). You can always change this to another directory and/or file name that is easier for you to work with. 2. Now just run WinSCP as you would do without logging. 3. To mail the result if you used the default log file name %TEMP%\!S.log: 1. Start a new mail in your favourite mail program (it could even be a web mail service). 2. Click whatever button or menu choice you need to add an attachment. 3. Many mail programs will now show you a standard Windows dialog window to select the file. In many mail programs, the left top of the window will look like this (a screen capture from a Windows 7 computer):

22 Chapter 2. Access and data transfer VSC documentation Documentation, Release 1.0

Click right of the text in the URL bar in the upper left of the window. The contents will now change to a regular Windows path name and will be selected. Just type %TEMP% and press enter and you will see that %TEMP% will expand to the name of the directory with the temporary files. This trick may not work with all mail programs! 4. Finish the mail text and send the mail to user support.

2.3.3 Data transfer with scp/sftp

Prerequisite: OpenSSH scp (Secure copy) and sftp (Secure FTP) are part of the OpenSSH distribution. See the page on generating keys.

Using scp

How to copy a file?

Files can be transferred with the scp, command, which is similar to the standard cp shell command to copy files. However, scp can copy to and from remote systems that runs an sshd daemon. For example, to copy the (local) file local_file.txt to your home directory on the cluster (where is a loginnode), use:

$ scp local_file.txt @:

Likewise, to copy the remote file remote_file.txt from your home directory on the cluster to your local com- puter, use:

$ scp @:remote_file.txt .

Note: The colon in the remote path is required!

Suppose you want to copy multiple files data_.txt from the current working directory on your local system to a directory called inputs in your data directory on a VSC system, you can use globbing, just as you would for cp.

$ scp data_*.txt [email protected]:/data/leuven/500/vsc50005/inputs

Warning: Although it might be tempting to use the $VSC_DATA environment variable, this will not work. The variable will be expanded on your local system, where it is not defined, resulting in a copy to a directory inputs in your VSC home directory.

2.3. Transferring data 23 VSC documentation Documentation, Release 1.0

Copying directories

Similar to cp copying a directory can be done using the -r flag, e.g.,

$ scp -r inputs/ [email protected]:/data/leuven/500/vsc50005/

This will copy the directory (and all of its contents) from your local system to your data directory on the VSC remote system.

Using sftp

sftp is an equivalent of the ftp command, but it uses the secure SSH protocol to connect to the clusters. One easy way of starting a sftp session is

$ sftp @

You can now transfer files to and from the remote system ‘. Some useful sftp commands are listed in the table below.

operation remote local change directory cd lcd list directory content ls lls copy file from get put glob copy from mget mput quit bye

Links

• scp manual page (external) • sftp manual page (external)

2.4 GUI applications on the clusters

Optionally, if you wish to use programs with a graphical user interface, you’ll need an X server on your client system. Again, this depends on the latter’s operating system: • Windows • Linux • macOS/OS X

2.5 VPN

Logging in to the login nodes of your institute’s cluster may not work if your computer is not on your institute’s network (e.g., when you work from home). In those cases you will have to set up a VPN (Virtual Private Networking) connection if your institute provides this service.

24 Chapter 2. Access and data transfer CHAPTER 3

Software stack

Software installation and maintenance on HPC infrastructure such as the VSC clusters poses a number of challenges not encountered on a workstation or a departmental cluster. For many libraries and programs, multiple versions have to installed and maintained as some users require specific versions of those. In turn, those libraries or executables sometimes rely on specific versions of other libraries, further complicating the matter. The way Linux finds the right executable for a command, and a program loads the right version of a library or a plug-in, is through so-called environment variables. These can, e.g., be set in your shell configuration files (e.g., .bashrc), but this requires a certain level of expertise. Moreover, getting those variables right is tricky and requires knowledge of where all files are on the cluster. Having to manage all this by hand is clearly not an option. We deal with this on the VSC clusters in the following way. First, we’ve defined the concept of a toolchain. They consist of a set of compilers, MPI library and basic libraries that work together well with each other, and then a number of applications and other libraries compiled with that set of tools and thus often dependent on those. We use tool chains based on the Intel and GNU compilers, and refresh them twice a year, leading to version numbers like 2014a, 2014b or 2015a for the first and second refresh of a given year. Some tools are installed outside a toolchain, e.g., additional versions requested by a small group of users for specific experiments, or tools that only depend on basic system libraries. Second, we use the module system to manage the environment variables and all dependencies and possible conflicts between various programs and libraries, and that is what this page focuses on.

3.1 Using the module system

Many software packages are installed as modules. These packages include compilers, interpreters, mathematical software such as Matlab and SAS, as well as other applications and libraries. This is managed with the module command.

3.1.1 Available modules

To view a list of available software packages, use the command module av. The output will look similar to this:

25 VSC documentation Documentation, Release 1.0

$ module av ----- /apps/leuven/skylake/2018a/modules/all ------Autoconf/2.69-GCC-4.8.2 Autoconf/2.69-intel-2018a Automake/1.14-GCC-4.8.2 Automake/1.14-intel-2018a BEAST/2.1.2 ... pyTables/2.4.0-intel-2018a-Python-2.7.6 timedrun/1.0.1 worker/1.4.2-foss-2018a zlib/1.2.8-foss-2018a zlib/1.2.8-intel-2018a

3.1.2 Module names

In general, the anatomy of a module name is

/-[-]

For example for Boost/1.66.0-intel-2018a-Python-3.6.4, we have • : Boost, the name of the library, • : 1.66.0, the version of the Boost library, • : intel-2018a, the toolchain Boost was built with, and • : Python-3.6.4, the version of Python this Boost version can inter-operate with. Some packages in the list above include intel-2014a or foss-2014a in their name. These are packages installed with the 2014a versions of the toolchains based on the Intel and GNU compilers respectively. The other packages do not belong to a particular toolchain. The name of the packages also includes a version number (right after the /) and sometimes other packages they need.

3.1.3 Searching modules

Often, when looking for some specific software, you will want to filter the list of available modules, since it tends to be rather large. The module command writes its output to standard error, rather than standard output, which is somewhat confusing when using pipes to filter. The following command would show only the modules that have the string ‘python’ in their name, regardless of the case.

$ module av |& grep -i python

For more comprehensive searches, you can use module spider, e.g.,

$ module spider python

3.1.4 Info on modules

The spider sub-command can also be used to provide information on on modules, e.g.,

26 Chapter 3. Software stack VSC documentation Documentation, Release 1.0

$ module spider Python/2.7.14-foss-2018a

------Python: Python/2.7.14-foss-2018a ------Description: Python is a programming language that lets you work more quickly and integrate your systems more effectively.

This module can be loaded directly: module load Python/2.7.14-foss-2018a

More technical information can be obtained using the show sub-command, e.g.,

$ module show Python/2.7.14-foss-2018a

3.1.5 Loading modules

A module is loaded using the command module load with the name of the package, e.g., with the above list of modules,

$ module load BEAST will load the BEAST/2.1.2 package. For some packages, e.g., zlib in the above list, multiple versions are installed; the module load command will automatically choose the lexicographically last, which is typically, but not always, the most recent version. In the above example,

$ module load zlib will load the module zlib/1.2.8-intel-2014a. This may not be the module that you want if you’re using the GNU compilers. In that case, the user should specify a particular version, e.g.,

$ module load zlib/1.2.8-foss-2014a

Note: Loading modules with explicit versions is considered best practice. It ensures that your scripts will use the expected version of the software, regardless of newly installed software. Failing to do this may jeopardize the reproducibility of your results!

Modules need not be loaded one by one; the two ‘load’ commands can be combined as follows:

$ module load BEAST/2.1.2 zlib/1.2.8-foss-2014a

This will load the two modules and, automatically, the respective toolchains with just one command.

Warning: Do not load modules in your .bashrc, .bash_profile or .profile, you will shoot yourself in the foot at some point. Consider using module collections restore as a command line alternative (so not in the shell initialization files either!).

3.1. Using the module system 27 VSC documentation Documentation, Release 1.0

3.1.6 List loaded modules

Obviously, the user needs to keep track of the modules that are currently loaded. After executing the above two load commands, the list of loaded modules will be very similar to:

$ module list Currently Loaded Modulefiles: 1) /thinking/2014a 2) Java/1.7.0_51 3) icc/2013.5.192 4) ifort/2013.5.192 5) impi/4.1.3.045 6) imkl/11.1.1.106 7) intel/2014a 8) beagle-lib/20140304-intel-2014a 9) BEAST/2.1.2 10) GCC/4.8.2 11) OpenMPI/1.6.5-GCC-4.8.2 12) gompi/2014a 13) OpenBLAS/0.2.8-gompi-2014a-LAPACK-3.5.0 14) FFTW/3.3.3-gompi-2014a 15) ScaLAPACK/2.0.2-gompi-2014a-OpenBLAS-0.2.8-LAPACK-3.5.0 16) foss/2014a 17) zlib/1.2.8-foss-2014a

It is important to note at this point that, e.g., icc/2013.5.192 is also listed, although it was not loaded explicitly by the user. This is because BEAST/2.1.2 depends on it, and the system administrator specified that the intel toolchain module that contains this compiler should be loaded whenever the BEAST module is loaded. There are advantages and disadvantages to this, so be aware of automatically loaded modules whenever things go wrong: they may have something to do with it!

3.1.7 Unloading modules

To unload a module, one can use the module unload command. It works consistently with the load command, and reverses the latter’s effect. One can however unload automatically loaded modules manually, to debug some problem.

$ module unload BEAST

Notice that the version was not specified: the module system is sufficiently clever to figure out what the user intends. However, checking the list of currently loaded modules is always a good idea, just to make sure. . .

3.1.8 Purging modules

In order to unload all modules at once, and hence be sure to start with a clean slate, use:

$ module purge

Note: It is a good habit to use this command in PBS scripts, prior to loading the modules specifically needed by applications in that job script. This ensures that no version conflicts occur if the user loads module using his .bashrc file.

28 Chapter 3. Software stack VSC documentation Documentation, Release 1.0

3.1.9 Getting help

To get a list of all available module commands, type:

$ module help

3.1.10 Collections of modules

Although it is convenient to set up your working environment by loading modules in your .bashrc or .profile file, this is error prone and you will end up shooting yourself in the foot at some point. The module system provides an alternative approach that lets you set up an environment with a single command, offering a viable alternative to polluting your .bashrc. Define an environment 1. Be sure to start with a clean environment

$ module purge

2. Load the modules you want in your environment, e.g.,

$ module load matplotlib/2.1.2-intel-2018a-Python-3.6.4 $ module load matlab/R2019a

3. save your environment, e.g., as data_analysis

$ module save data_analysis

Use an environment

$ module restore data_analysis

List all your environments

$ module savelist

Remove an environment

$ rm ~/.lmod.d/data_analysis

3.1.11 Specialized software stacks

The list of software available on a particular cluster can be unwieldingly long and the information that module av produces overwhelming. Therefore the administrators may have chosen to only show the most relevant packages by default, and not show, e.g., packages that aim at a different cluster, a particular node type or a less complete toolchain. Those additional packages can then be enabled by loading another module first. E.g., to get access to the modules in the (at the time of writing) incomplete 2019a toolchain on UAntwerpen’s leibniz cluster, one should first enter

$ module load leibniz/2019a-experimental

3.1. Using the module system 29 VSC documentation Documentation, Release 1.0

30 Chapter 3. Software stack CHAPTER 4

Running jobs

An HPC cluster is a multi-user system. This implies that your computations run on a part of the cluster that will be temporarily reserved for you by the scheduler.

Warning: Do not run computationally intensive tasks on the login nodes! These nodes are shared among all active users, so putting a heavy load on those nodes will annoy other users.

Although you can work interactively on an HPC system, most computations are performed in batch mode. The workflow is straightforward: 1. Create a job script. 2. Submit it as a job to the scheduler. 3. Wait for the computation to run and finish.

4.1 Job script

A job script is essentially a Bash script, augmented with information for the scheduler. As an example, consider a file hello_world.pbs as below.

1 #!/usr/bin/env bash

2

3 #PBS -l nodes=1:ppn=1

4 #PBS -l walltime=00:05:00

5 #PBS -l pmem=1gb

6

7 cd $PBS_O_WORKDIR

8

9 module purge

10 module load Python/3.7.2-foss-2018a (continues on next page)

31 VSC documentation Documentation, Release 1.0

(continued from previous page)

11

12 python hello_world.py

We discuss this script line by line. • Line 1 is a she-bang that indicates that this is a Bash script. • Lines 3-5 inform the scheduler about the resources required by this job. – It requires a single node (nodes=1), and a single core (ppn=1) on that node. – It will run for at most 5 minutes (walltime=00:05:00). – It will use at most 1 GB of RAM (pmem=1gb). • Line 7 changes the working directory to the directory in which the job will be submitted (that will be the value of the $PBS_O_WORKDIR environment variable when the job runs). • Lines 9 and 10 set up the environment by loading the appropriate modules. • Line 12 performs the actual computation, i.e., running a Python script.

Warning: When running on KU Leuven/UHasselt and Tier-1 infrastructure make sure to specify a credit account as part of your job script, if not, your job will not run. #PBS -A lp_example

For more information, see credit system basics.

Every job script has the same basic structure.

Note: Although you can use any file name extension you want, it is good practice to use .pbs since that allows support staff to easily identify your job script.

More information is available on • specifying job resources, • specifying job names, output files and notifications, • using the credit system (KU Leuven/UHasselt infrastructure and Tier-1 only), • using the module system.

4.2 Submitting and monitoring a job

Once you have created your job script, and transferred all required input data if necessary, you can submit your job to the scheduler

$ qsub hello_world.pbs 205814.leibniz

The qsub returns a job ID, an unique identifier that you can use to manage your job. Only the number, i.e., 205814 is significant. Once submitted, you can monitor the status of your job using the qstat command.

32 Chapter 4. Running jobs VSC documentation Documentation, Release 1.0

$ qstat Job ID Name User Time Use S Queue ------205814.leibniz hello_world.pbs vsc30140 0 Q q1h

The status of your job is given in the S column. The most common values are given below.

status meaning Q job is queued, i.e., waiting to be executed R job is running C job is completed, i.e., finished.

More information is available on

4.2.1 Submitting and managing jobs with Torque and Moab

Submitting your job: qsub

Once your job script is finished, you submit it to the scheduling system using the qsub command to place your job in the queue:

$ qsub 205814.leibniz

When qsub successfully queues your job, it responds with a job ID, 205814.leibniz in the example above. This is a unique identifier for your job, and can be used to manage it. In general, the number will suffice for this purpose. As explained on the pages on resource specification and specifying output files and notifications, there are several options to inform the scheduler about the resources your jobs requires, or whether you want to be notified on events related to your job. These options can be specified • at the top of your job script, or/and • as additional command line options for qsub. In case both are used, options given on the command line take precedence over those in the job script. For example, suppose the job script has the following directive:

#PBS -l walltime=2:00:00

However, when submitting it with qsub, you specify -l walltime=1:30:00, the maximum walltime for your job will be 1 hour, 30 minutes.

Starting interactive jobs

Though our clusters are mainly meant to be used for batch jobs, there are some facilities for interactive work: • The login nodes can be used for light interactive work. They can typically run the same software as the compute nodes. Some sites also have special interactive nodes for special tasks, e.g., scientific data visualization. See the “VSC hardware” section where each site documents what is available. Examples of work that can be done on the login nodes : – running a GUI program that generates the input files for your simulation,

4.2. Submitting and monitoring a job 33 VSC documentation Documentation, Release 1.0

– a not too long compile, – a quick and not very resource intensive visualization. We have set limits on the compute time a program can use on the login nodes. • It is also possible to request one or more compute nodes for interactive work. This is also done through the qsub command. Interactive use of nodes is mostly meant for – debugging, – for large compiles, or – larger visualizations on clusters that don’t have dedicated nodes for visualization. To start an interactive job, use qsub’s -I option. You would typically also add several -l options to specify for how long you need the node and the amount of resources that you need. For instance, to use a node with 20 cores interactively for 2 hours, you can use the following command:

qsub-I-l walltime=2:00:00-l nodes=1:ppn=20

qsub will block until it gets a node and then you get the command prompt for that node. If the wait is too long however, qsub will return with an error message and you’ll need to repeat the command. If you want to run graphical user interface programs (using X) in your interactive job, you have to add the -X option to the above command. This will set up the forwarding of X traffic to the login node, and ultimately to your terminal if you have set up the connection to the login node properly for X support.

Note: • Please be reasonable when requesting interactive resources. On some clusters, a short walltime will give you a higher priority, and on most clusters a request for a multi-day interactive session will fail simply because the cluster cannot give you such a node before the time-out of qsub kicks in. • Please act responsibly, interactive jobs are by definition inefficient: the systems are mostly idling while you type.

Viewing your jobs in the queue

Two commands can be used to show your jobs in the queue: • qstat show the queue from the resource manager’s perspective. It doesn’t know about priorities, only about requested resources and the state of your job: Still idle and waiting for resources, running, completed, . . . • showq shows the queue from the scheduler’s perspective, taking priorities and policies into account.

qstat

On the VSC clusters, users will only receive a part of the information that qstat offers. To protect the users’ privacy, output is always restricted to the user’s own jobs. To see your jobs in the queue, enter:

$ qstat

This will give you an overview of all jobs including their status, possible values are listed in the table below.

34 Chapter 4. Running jobs VSC documentation Documentation, Release 1.0

status meaning Q job is queued, i.e., waiting to be executed S job is starting, i.e., its prologue is executed R job is running E job is exiting, i.e., its epilogue is executed C job is completed, i.e., finished. H job has a hold in place

Several command line options can be specified to modify the output of qstat: • -i will show you the resources the jobs require. • -n or -n1 will also show you the nodes allocated to each running job. showq

The showq command will show you information about the queue from the scheduler’s perspective. Jobs are subdi- vided in three categories: • Active jobs are actually running, started or terminated. • Eligible jobs are queued and considered eligible for scheduling. • Blocked jobs are ineligible to run or to be queued for scheduling. The showq command will split its output according to the three major categories. Active jobs are sorted according to their expected end time while eligible jobs are sorted according to their current priority. There are multiple reasons why a job might be blocked, indicated by the state value below: Idle Job violates a fairness policy, i.e., you have used too many resources lately. Use diagnose -q for more informa- tion. UserHold A user hold is in place. This may be caused by job dependencies. SystemHold An administrative or system hold is in place. The job will not start until that hold is released. BatchHold A scheduler batch hold is in place, used when the job cannot be run because • the requested resources are not available in the system, or • because the resource manager has repeatedly failed in attempts to start the job. This typically indicates a problem with some nodes of the cluster, so you may want to contact user support. Deferred A scheduler defer hold is in place (a temporary hold used when a job has been unable to start after a specified number of attempts. This hold is automatically removed after a short period of time). NotQueued Job is in the resource manager state NQ (indicating the job’s controlling scheduling daemon in unavail- able). If your job is blocked, you may want to run the checkjob command to find out why. There are some useful options for showq: • -r will show you the running jobs only, but will also give more information about these jobs, including an estimate about how efficiently they are using the CPU. • -i will give you more information about your eligible jobs. • -p will only show jobs running in the specified partition.

4.2. Submitting and monitoring a job 35 VSC documentation Documentation, Release 1.0

A note on queues

Both qstat and showq can show you the name of the queue (qstat) or class (showq) which in most cases is actually the same as the queue. All VSC clusters have multiple queues that are used to define policies. E.g., users may be allowed to have many short jobs running simultaneously, but may be limited to a few multi-day jobs to avoid long-time monopolization of a cluster by a single user. This would typically be implemented by having separate queues with specific policies for short and long jobs. When you submit a job, qsub will put the job in a particular queue based on the resources requested automatically.

Warning: The qsub command does allow to specify the queue to use, but unless explicitly instructed to do so by user support, we advise strongly against the use of this option. Putting the job in the wrong queue may actually result in your job being refused by the resource manager, and we may also chose to change the available queues on a system to implement new policies.

Getting detailed information about a job

qstat

To get detailed information on a single job, add the job ID as argument and use the -f or -f1 option:

$ qstat -f

The -n or -n1 will just show you the nodes allocated to each running job in addition to regular output.

checkjob

The checkjob command also provides details about a job, but from the perspective of the scheduler, so that you get different information. The command below will produce information about the job with jobid 323323:

$ checkjob 323323

Adding the -v option (for verbose) gives you even more information:

$ checkjob -v 323323

For a running job, checkjob will give you an overview of the allocated resources and the wall time consumed so far. For blocked jobs, the end of the output typically contains clues about why a job is blocked.

Deleting a queued or running job: qdel

This is easily done with qdel, e.g., the following command will delete the job with ID 323323:

$ qdel 323323

If the job is already running, the processes will be killed and the resources will be returned to the scheduler for another job.

36 Chapter 4. Running jobs VSC documentation Documentation, Release 1.0

Getting a start time estimate for your job: showstart

This is a very simple tool that will tell you, based on the current status of the cluster, when your job is scheduled to start:

$ showstart 20030021 job 20030021 requires 896 procs for 1:00:00 Earliest start in 5:20:52:52 on Tue Mar 24 07:36:36 Earliest completion in 5:21:52:52 on Tue Mar 24 08:36:36 Best Partition: DEFAULT

Note: This is only an estimate, based on the jobs that are currently running or queued and the walltime that users gave for these jobs. • Jobs may always end sooner than requested, so your job may start sooner. • On the other hand, jobs with a higher priority may also enter the queue and delay the start of your job.

Checking free resources for a short job: showbf

When the scheduler performs its task, there is bound to be some gaps between jobs on a node. These gaps can be back filled with small jobs. To get an overview of these gaps, you can execute the command showbf:

$ showbf backfill window (user: 'vsc30001' group: 'vsc30001' partition: ALL) Wed Mar 18

˓→10:31:02 323 procs available for 21:04:59 136 procs available for 13:19:28:58

To check whether a job can run in a specific partition, add the -p option.

Note: There is however no guarantee that if you submit a job that would fit in the available resources, it will also run immediately. Another user might be doing the same thing at the same time, or you may simply be blocked from running more jobs because you already have too many jobs running or have made heavy use of the cluster recently.

4.3 Job output

By default, the output of your job is saved to two files. .o This file contains all text written to standard output, as well as some information about your job. .e This file contains all text written to standard error, if any. If your job fails, or doesn’t produce the expected output, this is the first place to look. For instance, for the running example, the output file would be hello_world.pbs.o205814 and contains

1 ===== start of prologue =====

2 Date : Mon Aug 5 14:50:28 CEST 2019

3 Job ID : 205814

4 Job Name : hello_world.pbs

5 User ID : vsc30140 (continues on next page)

4.3. Job output 37 VSC documentation Documentation, Release 1.0

(continued from previous page)

6 Group ID : vsc30140

7 Queue Name : q1h

8 Resource List : walltime=00:05:00,nodes=1:ppn=1,neednodes=1:ppn=1

9 ===== end of prologue ======

10

11 hello world!

12

13 ===== start of epilogue =====

14 Date : Mon Aug 5 14:50:29 CEST 2019

15 Session ID : 21768

16 Resources Used : cput=00:00:00,vmem=0kb,walltime=00:00:02,mem=0kb,energy_used=0

17 Allocated Nodes : r3c08cn1.leibniz

18 Job Exit Code : 0

19 ===== end of epilogue ======

Lines 1 through 10 are written by the prologue, i.e., the administrative script that runs before your job script. Similarly, lines 12 though 19 are written by the epilogue, i.e., the administrative script that runs after your job script. Line 11 is the actual output of your job script.

Note: The format of the output file differs slightly from cluster to cluster, although the overall structure is the same.

4.4 Troubleshooting

4.4.1 Why doesn’t my job start?

Jobs are submitted to a queue system, which is monitored by a scheduler that determines when a job can be executed. The latter depends on two factors: 1. the priority assigned to the job by the scheduler, and the priorities of the other jobs already in the queue, and 2. the availability of the resources required to run the job. The priority of a job is calculated using a formula that takes into account a number of criteria. 1. the user’s credentials (at the moment, all users are equal). 2. Fair share: this takes into account the walltime the user has used over the last seven days. The more used, the lower the resulting priority. 3. Time queued: the longer a job spends in the queue, the higher its priority becomes, so that it will run eventually. 4. Requested resources: larger jobs get a higher priority. These factors are used to compute a weighted sum at each iteration of the scheduler to determine a job’s priority. Due to the time queued and fair share, this is not static, but evolves over time while the job is in the queue. Different clusters use different policies as some clusters are optimized for a particular type of job. To get an idea when your job might start, you could try Moab’s showstart command. Also, don’t try to outsmart the scheduler by explicitly specifying nodes that seem empty when you submit your job. The scheduler may be reserving these nodes for a job that requires multiple nodes, so your job will likely spend even more time in the queue, since the scheduler will not launch your job on another node which may be available sooner.

38 Chapter 4. Running jobs VSC documentation Documentation, Release 1.0

Remember that the cluster is not intended as a replacement for a decent desktop PC. Short, sequential jobs may spend quite some time in the queue, but this type of calculation is atypical from an HPC perspective. If you have large batches of (even relatively short) sequential jobs, you can still pack them as longer sequential or even parallel jobs and get to run them sooner. User support can help you with that, or see the page How can I run many similar computations conveniently?.

4.4.2 My jobs seem to run, but I don’t see any output or errors?

You ran out of time

It is possible the job exceeded the walltime that was specified as part of the required resources, or the default value otherwise. If this is the case, the resource manager will terminate your job, and the job’s output file will contain a line similar to:

=>> PBS: job killed: walltime exceeded limit

Try to submit your job specifying a larger walltime.

You ran out of disk space

You may have exceeded the disk quota for your home directory, i.e., the total file size for your home directory is just too large. When a job runs, it needs to store temporary output and error files in your home directory. When it fails to do so, the program will crash, and you won’t get feedback, since that feedback would be in the error file that can’t be written. See the FAQs listed below to check the amount of disk space you are currently using, and for a few hints on what data to store where. However, your home directory may unexpectedly fill up in two ways: 1. a running program produces large amounts of output or errors; 2. a program crashes and produces a core dump.

Note: That one job that produces output or a core dump that is too large for the file system quota will most probably cause all your jobs that are queued to fail.

Large amounts of output or errors

To deal with the first issue, simply redirect the standard output of the command to a file that is in your data or scratch directory, or, if you don’t need that output anyway, redirect it to /dev/null. A few examples that can be used in your job scripts that execute, e.g., my-prog, are given below. To send standard output to a file, you can use: my-prog > $VSC_DATA/my-large-output.txt

If you want to redirect both standard output and standard error, use: my-prog > $VSC_DATA/my-large-output.txt \ 2> $VSC_DATA/my-large-error.txt

4.4. Troubleshooting 39 VSC documentation Documentation, Release 1.0

To redirect both standard output and standard error to the same file, use: my-prog &> $VSC_DATA/my-large-output-error.txt

If you don’t care for the standard output, simply write: my-prog>/dev/null

Core dump

When a program crashes, a core file is generated. This can be used to try and analyze the cause of the crash. However, if you don’t need cores for post-mortem analysis, simply add the following line to your .bashrc file: ulimit-c0

This can be done more selectively by adding this line to your job script prior to invoking your program. You can find all the core dumps in your home directory using:

$ find $VSC_HOME -name "core.*"

They can be removed (make sure that only unwanted core files are removed by checking with the command above) using:

$ find $VSC_HOME -name "core.*" -exec rm {} +

You ran out of memory (RAM)

The resource manager monitor the memory usage of your application, and will automatically terminate your job when that memory exceeds a limit. This limit is either the value specified in the resource request using pmem or pvmem, or the default value. You may find an indication that this may be the case by looking at the job’s output file. The epilogue information lists the resources used by the job, including memory.

Resources Used : cput=00:00:00,vmem=110357kb,walltime=00:34:02,mem=984584kb

If the value of mem is close to the limit, this may indicate that the application used too much memory. The used resources are just a rough indication, and the reported value can be lower than the actual value if the applica- tion’s memory usage rapidly increased. Hence it is prudent to monitor the memory consumption of your job in more detail. You can try to resubmit your job specifying more memory per core.

4.5 Advanced topics

• monitoring memory and cpu, which helps to find the right parameters to improve your specification of the job requirements. • worker framework: To manage lots of small jobs on a cluster. The cluster scheduler isn’t meant to deal with tons of small jobs. Those create a lot of overhead, so it is better to bundle those jobs in larger sets.

40 Chapter 4. Running jobs VSC documentation Documentation, Release 1.0

• The checkpointing framework can be used to run programs that take longer than the maximum time allowed by the queue. It can break a long job in shorter jobs, saving the state at the end to automatically start the next job from the point where the previous job was interrupted.

4.5. Advanced topics 41 VSC documentation Documentation, Release 1.0

42 Chapter 4. Running jobs CHAPTER 5

Software development

5.1 Programming paradigms

5.1.1 MPI distributed programming

Purpose

MPI (Message Passing Interface) is a language-independent communications protocol used to program parallel com- puters. Both point-to-point and collective communication are supported. MPI “is a message-passing application programmer interface, together with protocol and semantic specifications for how its features must behave in any im- plementation.” MPI’s goals are high performance, scalability, and portability. MPI remains the dominant model used in high-performance computing today. The current version of the MPI standard is 4.0, but hardly any implementations currently have support for the new features. All recent implementations support the full 3.1 standard though.

Some background information

MPI-1.0 (1994) and its updates MPI-1.1 (1995), MPI-1.2 (1997) and MPI-1.3 (1998) concentrate on point-to-point communication (send/receive) and global operations in a static process topology. Major additions in MPI-2.0 (1997) and its updates MPI-2.1 (2008) and MPI-2.2 (2009) are one-sided communication (get/put), dynamic process man- agement and a model for parallel I/O. MPI-3.0 (2012) adds non-blocking collectives, a major update of the one-sided communication model and neighborhood collectives on graph topologies. An update of the specification, MPI-3.1 was released in 2015. The MPI-4.0 was formally approved in June 2021. The two dominant Open Source implementations are Open MPI and MPICH. The latter has been through a couple of name changes: It was originally conceived in the early ‘90’s as MPICH, then the complete rewrite was renamed to MPICH2, but as this name caused confusion as the MPI standard evolved into MPI 3.x, the name was changed again to MPICH, and the version number bumped to 3.0. MVAPICH developed at Ohio State University is the offspring of MPICH further optimized for InfiniBand and some other high-performance interconnect technologies. Most other MPI implementations are derived from one of these implementations.

43 VSC documentation Documentation, Release 1.0

At the VSC we offer both implementations: Open MPI is offered with the GNU compilers in the “FOSS toolchain”, while the Intel MPI used in the “Intel toolchain” is derived from the MPICH code base.

Prerequisites

You have a program that uses an MPI library, either developed by you, or by others. In the latter case, the program’s documentation should mention the MPI library it was developed with.

Implementations

On VSC clusters, several MPI implementations are installed. We provide two MPI implementations on all newer machines that implement the MPI-3.1 specification. 1. Intel MPI in the Intel toolchain 2. Open MPI in the FOSS toolchain When developing your own software, this is the preferred order to select an implementation. The performance should be very similar, however, more development tools are available for Intel MPI (e.g., “Intel Trace Analyzer & Collector” for performance monitoring). Several other implementations may be installed, e.g., MVAPICH, but we assume you know what you’re doing if you choose to use them. We also assume you are already familiar with the job submission procedure. If not, check the “Running jobs” section first.

Compiling and running

See to the documentation about the Toolchains.

Debugging

For debugging, we recommend the Arm DDT debugger (formerly Allinea DDT, module allinea-ddt). The debugger and the profiler Arm MAP (formerly Allinea MAP) are now bundled nito ArmForge, which is available as a module on KU Leuven systems. Video tutorials are available on the Arm website: ARM-DDT video. (KU Leuven-only). When using the Intel toolchain, “Intel Trace Analyzer & Collector” (ITAC) may also prove useful.

Profiling

To profile MPI applications, one may use Arm-MAP (formerly Allinea MAP) or Scalasca docs. (KU Leuven-only)

Further information

• Intel MPI web site – Intel MPI Documentation (Latest version) • Open MPI web site – Open MPI Documentation • SGI MPT, now HPE Performance Software MPI

44 Chapter 5. Software development VSC documentation Documentation, Release 1.0

– HPE MPT Documentation • MPI forum, where you can also find the standard specifications – MPI Standard documents • See also the pages in the tutorials section e.g., for books and online tutorial web tutorials

5.1.2 OpenMP for shared memory programming

Purpose

OpenMP (Open Multi-Processing) is an API that supports multi-platform shared memory multiprocessing program- ming in C, C++, and Fortran, on most processor architectures and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. OpenMP uses a portable, scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer. The current version of the OpenMP specification is 5.1, released in November 2020. However, not all compilers already fully support this standard. The previous specification were OpenMP 5.0 (Novem- ber 2018), OpenMP 4.5 (November 2015) and OpenMP 4.0 (July 2013).

Prerequisites

You should have a program that uses the OpenMP API.

Implementations

On the VSC clusters, the following compilers support OpenMP: Intel compilers in the Intel toolchain The Intel compiler version 18.0 (intel/2018a and intel/2018b toolchains) offers almost complete OpenMP 4.5 support. GCC compilers in the FOSS toolchain GCC 6.x (foss/2018a) offers full OpenMP 4.5 support in C and C++, in- cluding offloading to some variants of the Xeon Phi and to AMD HSAIL and some support for OpenACC on NVIDIA. For Fortran, OpenMP 4.0 is supported. For an overview of compiler (version) support for the various OpenMP specifications, see the OpenMP compilers and tools page.

Note: The GCC OpenMP runtime is for most applications inferior to the Intel implementation.

Compiling OpenMP code

See the instructions on the page about toolchains for compiling OpenMP code with the Intel and GCC compilers.

Note: It is in fact possible to link OpenMP object code compiled with GCC and the Intel compiler on the condition that the Intel OpenMP libraries and run-time is used (e.g., by linking using icc with the -qopenmp option), but the Intel manual is not clear which versions of gcc and icc work together well. This is only for specialists but may be useful if you only have access to object files and not to the full source code.

5.1. Programming paradigms 45 VSC documentation Documentation, Release 1.0

Running OpenMP programs

We assume you are already familiar with the job submission procedure. If not, check the Running jobs section first. Since OpenMP is intended for use in a shared memory context, when submitting a job to the queue system, remember to request a single node and as many processors as you need parallel threads (e.g., -l nodes=1:ppn=4). The latter should not exceed the number of cores on the machine the job runs on. For relevant hardware information, please consult the list of available hardware. You may have to set the number of cores that the program should use by hand, e.g., when you don’t use all cores on a node, because the OpenMP runtime recognizes the number of cores available on the node, and not respect the number of cores assigned to the job. Depending on the program, this may be through a command line option to the executable, a value in the input file or the environment variable OMP_NUM_THREADS.

Warning: Failing to set this value may result in threads competing with each other for resources such as cache and access to the CPU and thus (much) lower performance.

Further information

• OpenMP contains the specifications and some documentation. It is the web site of the OpenMP Architecture Review Board where the standard is discussed. • See also the pages in the tutorials section and online tutorials. The tutorial at the site of Lawrence Livermore National Laboratory LLNL OpenMP tutorial (LLNL) is highly recom- mended.

5.1.3 Hybrid MPI/OpenMP programs

MPI and OpenMP both have their advantages and disadvantages. MPI can be used on distributed memory clusters and can scale to thousands of nodes. However, it was designed in the days that clusters had nodes with only one or two cores. Nowadays CPUs often have more than ten cores and sometimes support multiple hardware threads (or logical cores) per physical core (and in fact may need multiple threads to run at full performance). At the same time, the amount of memory per hardware thread is not increasing and is in fact quite low on several architectures that rely on a large number of slower cores or hardware threads to obtain a high performance within a reasonable power budget. Starting one MPI process per hardware thread is then a waste of resources as each process needs its communication buffers, OS resources, etc. Managing the hundreds of thousands of MPI processes that we are nowadays seeing on the biggest clusters, is very hard. OpenMP on the other hand is limited to shared memory parallelism, typically within a node of a cluster. Moreover, many OpenMP programs don’t scale past some tens of threads partly because of thread overhead in the OS implemen- tation and partly because of overhead in the OpenMP run-time. Hybrid programs try to combine the advantages of both to deal with the disadvantages. Hybrid programs use a limited number of MPI processes (“MPI ranks”) per node and use OpenMP threads to further exploit the parallelism within the node. An increasing number of applications is designed or re-engineered in this way. The optimum number of MPI processes (and hence OpenMP threads per process) depends on the code, the cluster architecture and the problem that is being solved, but often one or, on newer CPUs such as the Intel Haswell, two MPI processes per socket (so two to four for a typical two-socket node) is close to optimal. Compiling and starting such applications requires some care as we explain on this page.

46 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Preparing your hybrid application to run

To compile and link your hybrid application, you basically have to combine the instructions for MPI and OpenMP pro- grams: use mpicc -fopenmp for the GNU compilers and mpiicc -qopenmp for the Intel compilers ( mpiicc -openmp for older versions) or the corresponding MPI Fortran compiler wrappers for Fortran programs.

Running hybrid programs on the VSC clusters

When running a hybrid MPI/OpenMP program, fewer MPI processes have to be started than there are logoical cores available to the application as every process uses multiple cores in OpenMP parallelism. Yet when requesting logical cores per node to the scheduler, one still has to request the total number of cores needed per node. Hence the PBS property "ppn” should not be read as "processes per node” but rather as "logical cores per node” or " processing units per node”. Instead we have to tell the MPI launcher ( mpirun for most applications) to launch fewer processes than there are logical cores on a node and tell each MPI process to use the correct number of OpenMP threads. For optimal performance, the threads of one MPI process should be put together as close as possible in the logical core hierarchy implied by the cache and core topology of a given node. E.g., on a dual socket node it may make a lot of sense to run 2 MPI processes with each MPI process using all cores on a single socket. In other applications, it might be better to run only one MPI process per node, or multiple MPI processes per socket. In more technical words, each MPI process runs in its MPI domain consisting of a number of logical cores, and we want these domains to be non-overlapping and fixed in time during the life of the MPI job and the logical cores in the domain to be "close” to each other. This optimises the use the memory hierarchy (cache and RAM). OpenMP has several environment variables that can then control the number of OpenMP threads and the placement of the threads in the MPI domain. All of these may also be overwritten by the application, so it is not a bullet- proof way to control the behaviour of OpenMP applications. Moreover, some of these environment variables are implementation-specific and hence are different between the Intel and GNU OpenMP runtimes. The most important variable is OMP_NUM_THREADS. It sets the number of threads to be used in parallel regions. As parallel constructs can be nested, a process may still start more threads than indicated by OMP_NUM_THREADS. However, the total number of threads can be limited by the variable OMP_THREAD_LIMIT.

Script mympirun (VSC)

The mympirunn script is developed by the UGent VSC-team to cope with differences between different MPI im- plementations automatically. It offers support for hybrid programs through the --hybrid command line switch to specify the number of processes per node. The number of threads per process can then be computed by dividing the number of logical cores per node by the number of processes per node. E.g., to run a hybrid MPI/OpenMP program on 2 nodes using 20 cores on each node and running 4 MPI ranks per node (hence 5 OpenMP threads per MPI rank), your script would contain

#PBS -l nodes=2:ppn20 near the top to request the resources from the scheduler. It would then load the appropriate module with the mympirun command: module load vsc-mympirun

(besides other modules that are needed to run your application) and finally start your application: mympirun--hybrid=4./hybrid_mpi assuming your executable is called hybrid_mpi and resides in the working directory. The mympirun launcher will automatically determine the correct number of MPI processes to start based on the resource specifications and the given number of processes per node (the --hybrid switch).

5.1. Programming paradigms 47 VSC documentation Documentation, Release 1.0

Intel toolchain (Intel compilers and Intel MPI)

On Intel MPI defining the MPI domains is done through the environment variable I_MPI_PIN_DOMAIN. Note however that the Linux scheduler is still free to move all threads of a MPI process to any core within its MPI domain at any time, so there may be a point in further pinning the OpenMP threads through the OpenMP environment variables also. This is definitely the case if there are more logical cores available in the process partition than there are OpenMP threads. Some environment variables to influence the thread placement are the Intel-specific variable KMP_AFFINITY and the OpenMP 3.1 standard environment variable OMP_PROC_BIND. In our case, we want to use all logical cores of a node but make sure that all cores for a domain are as close together as possible. The easiest way to accomplish this is to set OMP_NUM_THREADS to the desired number of OpenMP threads per MPI process and then set I_MPI_PIN_DOMAIN to the value omp:

export I_MPI_PIN_DOMAIN=omp

The longer version is

export I_MPI_PIN_DOMAIN=omp,compact

where compact tells the launcher explicitly to pack threads for a single MPI process as close together as possible. This layout is the default on current versions of Intel MPI so it is not really needed to set this. An alternative, when running 1 MPI process per socket, is to set

export I_MPI_PIN_DOMAIN=socket

To enforce binding of each OpenMP thread to a particular logical core, one can set

export OMP_PROC_BIND=true

As an example, assume again we want to run the program hybridmpi on 2 nodes containing 20 cores each, running 4 MPI processes per node, so 5 OpenMP threads per process. The following are then essential components of the job script: • Specify the resource requirements: #PBS -lnodes=2:ppn=20 • Load the modules, including one which contains Intel MPI, e.g., module load intel • Create a list of unique hosts assigned to the job export HOSTS=$(sort -u $PBS_NODEFILE | paste -s -d,) This step is very important; the program will not start with the correct number of MPI ranks if it is not provided with a list of unique host names. • Set the number of OpenMP threads per MPI process: export OMP_NUM_THREADS=5 • Pin the MPI processes: export I_MPI_PIN_DOMAIN=omp • And launch hybrid_mpi using the Intel MPI launcher and specifying 4 MPI processes per host: mpirun -hosts $HOSTS -perhost 4 ./hybrid_mpi

In this case we do need to specify both the total number of MPI ranks and the number of MPI ranks per host as we want the same number of MPI ranks on each host. In case you need a more automatic script that is easy to adapt to a different node configuration or different number of processes per node, you can do some of the computations in Bash. The number of processes per node is set in the shell variable MPI_RANKS_PER_NODE. The above commands become:

48 Chapter 5. Software development VSC documentation Documentation, Release 1.0

#! /bin/bash -l # Adapt nodes and ppn on the next line according to the cluster your're using!#PBS -

˓→lnodes=2:ppn=20 ... MPI_RANKS_PER_NODE=4 # module load intel # export HOSTS=`sort -u $PBS_NODEFILE | paste -s -d,` # export OMP_NUM_THREADS=$(($PBS_NUM_PPN / $MPI_RANKS_PER_NODE)) # export OMP_PROC_BIND=true # export I_MPI_PIN_DOMAIN=omp # mpirun -hosts $HOSTS -perhost $MPIPROCS_PER_NODE ./hybrid_mpi

Intel documentation on hybrid programming

Some documents on the Intel web site that contain more information on developing and running hybrid programs: • Interoperability with OpenMP API in the MPI Reference Manual explains the concept of MPI domains and how they should be used/set for hybrid programs. • Beginning Hybrid MPI/OpenMP Development, useful if you develop your own code.

FOSS toolchain (GCC and Open MPI)

Open MPI has very flexible options for process and thread placement, but they are not always easy to use. There is however also a simple option to indicate the number of logical cores you want to assign to each MPI rank (MPI process): -cpus-per-proc with the number of logical cores assigned to each MPI rank. You may want to further control the thread placement one can using the standard OpenMP mechanism, e.g. the GNU- specific variable GOMP_CPU_AFFINITY or the OpenMP 3.1 standard environment variable OMP_PROC_BIND. As long as we want to use all cores, it won’t matter whether OMP_PROC_BIND is set to true, close or spread. However, setting OMP_PROC_BIND to true is generally a safe choice to assure that all threads remain on the same core as they were started on to improve cache performance. Essential elements of your job script are:

#! /bin/bash -l # Adapt nodes and ppn on the next line according to the cluster your're using! #PBS -lnodes=2:ppn=20 ... # module load foss # export OMP_NUM_THREADS=5 # export MPI_NUM_PROCS=$(( ${PBS_NP}/${OMP_NUM_THREADS} )) # mpirun --np ${MPI_NUM_PROCS} \ --map-by socket:PE=${OMP_NUM_THREADS} \ --bind-to core \ ./hybrid_mpi

5.1. Programming paradigms 49 VSC documentation Documentation, Release 1.0

Open MPI allows a lot of control over process placement and rank assignment. The Open MPI mpirun command has several options that influence this process: • --map-by influences the mapping of processes on the available processing resources • --rank-by influences the rank assignment • --bind-to influences the binding of processes to sets of processing resources • --report-bindings can then be used to report on the process binding. More information can be found in the manual pages for mpirun which can be found on the Open MPI webpages Open MPI Documentation and in the following presentations: • Poster paper "Locality-Aware Parallel Process Mapping for Multi-Core HPC Systems” • Slides from the presentation "Open MPI Explorations in Process Affinity” from EuroMPI’13

5.2 Development tools

5.2.1 Toolchains

What are toolchains?

A toolchain is a collection of tools to build (HPC) software consistently. It consists of • compilers for C/C++ and Fortran, • a communications library (MPI), and • mathematical libraries (linear algebra, FFT). Toolchains at the VSC are versioned, and refreshed twice a year. All software available on the cluster is rebuild when a new version of a toolchain is defined to ensure consistency. Version numbers consist of the year of their definition, followed by either a or b, e.g., 2014a. Note that the software components are not necessarily the most recent releases, rather they are selected for stability and reliability.

Available toolchains at the VSC

Two toolchain flavors are standard across the VSC on all machines that can support them: • Intel toolchain based on Intel software components, • FOSS toolchain based on free and open source software. It may be of interest to note that the Intel C/C++ compilers are more strict with respect to the standards than the GCC C/C++ compilers, while for Fortran, the GCC Fortran compiler tracks the standard more closely, while Intel’s Fortran allows for many extensions added during Fortran’s long history. When developing code, one should always build with both compiler suites, and eliminate all warnings. On average, the Intel compiler suite produces executables that are 5 to 10 % faster than those generated using the GCC compiler suite. However, for individual applications the differences may be more significant with sometimes significantly faster code produced by the Intel compilers while on other applications the GNU compiler may produce much faster code. Additional toolchains may be defined on specialised hardware to extract the maximum performance from that hard- ware. For detailed documentation on each of these toolchains, we refer to the pages linked above in this document.

50 Chapter 5. Software development VSC documentation Documentation, Release 1.0

5.2.2 Intel toolchain

The intel toolchain consists almost entirely of software components developed by Intel. When building third-party software, or developing your own, load the module for the toolchain:

$ module load intel/ where should be replaced by the one to be used, e.g., 2016b. See the documentation on the software module system for more details. Starting with the 2014b toolchain, the GNU compilers are also included in this toolchain as the Intel compilers use some of the libraries and as it is possible (though some care is needed) to link code generated with the Intel compilers with code compiled with the GNU compilers.

Compilers: Intel and GNU

Three compilers are available: •C: icc • C++: icpc • Fortran: ifort Compatible versions of the GNU C (gcc), C++ (g++) and Fortran (gfortran) compilers are also provided. For example, to compile/link a Fortran program fluid.f90 to an executable fluid with architecture specific optimization, use:

$ ifort -O2 -xHost -o fluid fluid.f90

For documentation on available compiler options, we refer to the links to the Intel documentation at the bottom of this page.

Note: Do not forget to load the toolchain module first!

Optimizing for a CPU architecture

To optimize your application or library for specific CPU architectures, use the appropriate option listed in the table below.

CPU architecture compilers option Ivy Bridge -xAVX Sandy Bridge -xAVX Haswell -xAVX2 Broadwell -xAVX2 Naples (AMD) -xAVX2 Rome (AMD) -xAVX2 Skylake -xAVX-512 Cascade Lake -xAVX-512 detect host CPU -xHost

For example, the application compiled with the command below will be optimized to run on a Haswell CPU:

5.2. Development tools 51 VSC documentation Documentation, Release 1.0

$ icc -O3 -xAVX2 -o floating_point floating_point.c

It is possible to build software that contains multiple code paths specific for the architecture that the application is running on. Additional code paths can be specified using the -ax option.

additional code path Intel compiler option Ivy Bridge/Sandy Bridge -axCORE-AVX Haswell/Broadwell -axCORE-AVX2 Naples/Rome (AMD) -axCORE-AVX2 Skylake/Cascade Lake -axCORE-AVX512

Hence the target architecture can be specified using the -x option, while additional code paths can be specified using -ax. For instance, the following compilation would create an executable with code paths for AVX, AVX2 and AVX-512 instruction sets:

$ icpc -O3 -xAVX -axCORE-AVX2,CORE-AVX512 floating_point.cpp

Software that has been built using these options will run with the appropriate instruction set on Ivy Bridge, Sandy Bridge, Haswell, Broadwell and Skylake CPUs.

Intel OpenMP

The compiler switch to use to compile/link OpenMP C/C++ or Fortran code is -qopenmp in recent versions of the compiler (toolchain intel/2015a and later) or -openmp in older versions. For example, to compile/link a OpenMP C program scatter.c to an executable scatter:

$ icc -qopenmp -O2 -o scatter scatter.c

Running an OpenMP job

Remember to specify as many processes per node as the number of threads the executable is supposed to run. This can be done using the ppn resource specification, e.g., -l nodes=1:ppn=10 for an executable that should be run with 10 OpenMP threads.

Warning: The number of threads should not exceed the number of cores on a compute node.

Communication library: Intel MPI

For the intel toolchain, impi, i.e., Intel MPI is used as the communications library. To compile/link MPI programs, wrappers are supplied, so that the correct headers and libraries are used automatically. These wrappers are: •C: mpiicc • C++: mpiicpc • Fortran: mpiifort Note that the names differ from those of other MPI implementations. The compiler wrappers take the same options as the corresponding compilers.

52 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Using the Intel MPI compilers

For example, to compile/link a C program thermo.c to an executable thermodynamics with architecture specific optimization, use:

$ mpiicc -O2 -xhost -o thermodynamics thermo.c

For further documentation, we refer to the links to the Intel documentation at the bottom of this page. Do not forget to load the toolchain module first.

Running an MPI program with Intel MPI

Note that an MPI program must be run with the exact same version of the toolchain as it was originally build with. The listing below shows a PBS job script thermodynamics.pbs that runs the thermodynamics executable.

#!/bin/bash -l module load intel/ cd $PBS_O_WORKDIR mpirun -np $PBS_NP ./thermodynamics

The resource manager passes the number of processes to the job script through the environment variable $PBS_NP, but if you use a recent implementation of Intel MPI, you can even omit -np $PBS_NP as Intel MPI recognizes the Torque resource manager and requests the number of cores itself from the resource manager if the number is not specified.

Intel mathematical libraries

The Intel Math Kernel Library (MKL) is a comprehensive collection of highly optimized libraries that form the core of many scientific HPC codes. Among other functionality, it offers: • BLAS (Basic Linear Algebra Subprograms), and extensions to sparse matrices • LAPACK (Linear Algebra PACKage) and ScaLAPACK (the distributed memory version) • FFT-routines including routines compatible with the FFTW2 and FFTW3 libraries (Fastest Fourier Transform in the West) • Various vector functions and statistical functions that are optimised for the vector instruction sets of all recent Intel processor families For further documentation, we refer to the links to the Intel documentation at the bottom of this page. There are two ways to link the MKL library: • If you use icc, icpc or ifort to link your code, you can use the -mkl compiler option: – -mkl=parallel or -mkl: Link the multi-threaded version of the library. – -mkl=sequential: Link the single-threaded version of the library – -mkl=cluster: Link the cluster-specific and sequential library, i.e., ScaLAPACK will be included, but as- sumes one process per core (so no hybrid MPI/multi-threaded approach) The Fortran95 interface library for lapack is not automatically included though. You’ll have to specify that library seperately. You can get the value from the MKL Link Line Advisor, see also the next item. • Or you can specify all libraries explictly. To do this, it is strongly recommended to use Intel’s MKL Link Line Advisor, and will also tell you how to link the MKL library with code generated with the GNU and PGI compilers. Note: On most VSC systems, the variable MKLROOT has a different value from the one assumed

5.2. Development tools 53 VSC documentation Documentation, Release 1.0

in the Intel documentation. Wherever you see $(MKLROOT) you may have to replace it with $(MKLROOT)/ mkl. MKL also offers a very fast streaming pseudorandom number generator, see the documentation for details.

Intel toolchain version numbers

toolchain icc/icpc/ifort Intel MPI Intel MKL UCX GCC binutils

2021a 2021.0.2 2021.0.2 2021.0.2 1.10.0 10.3.0 2.36.1 2020b 2020.4.304 2019.9.304 2020.4.304 1.9.0 10.2.0 2.35 2020a 2020.1.217 2019.7.217 2020.1.217 9.3.0 2.34 2019b 2019.5.281 2018.5.288 2019.5.281 8.3.0 2.32 2019a 2019.1.144 2018.4.274 2019.1.144 8.2.0 2.31.1 2018b 2018.3.222 2018.3.222 2018.3.222 7.3.0 2.30 2018a 2018.1.153 2018.1.153 2018.1.153 6.4.0 2.28 2017b 2017.3.196 2017.3.196 2017.3.196 6.4.0 2.28 2017a 2017.1.196 2017.1.196 2017.1.196 6.3.0 2.27 2016b 2016.3.210 5.1.3.181 11.3.3.210 5.4.0 2.26 2016a 16.0.3 5.1.3.181 11.3.3.210 4.9.3 2.25

Further information on Intel tools

• All Intel documentation of recent software versions is available in the Intel Software Documentation Library The documentation is typically available for the most recent version and sometimes one older version of te compiler and libraries. • Some other useful documents: – Quick-Reference Guide to Optimization with Intel® Compilers. – Direct link to the C/C++ compiler developer and reference guide – Direct link to the Fortran compiler user and reference guide – Page with links to the documentation of the most recent version of Intel MPI • MKL – Link page to the documentation of MKL on the Intel web site – MKL Link Line Advisor • Generic BLAS/LAPACK/ScaLAPACK documentation

5.2.3 FOSS toolchain

The foss toolchain consists entirely of free and open source software components. When building third-party soft- ware, or developing your own, load the module for the toolchain:

$ module load foss/ where should be replaced by the one to be used, e.g., 2014a. See the documentation on the software module system for more details.

54 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Compilers: GNU

Three GCC compilers are available: •C: gcc • C++: g++ • Fortran: gfortran For example, to compile/link a Fortran program fluid.f90 to an executable fluid with architecture specific optimization for processors that support AVX2 instructions, use:

$ gfortran -O2 -march=haswell -o fluid fluid.f90

Documentation on GCC other compiler flags and options is available on the project’s website for the GCC documen- tation.

Note: Do not forget to load the toolchain module first!

Optimizing for a CPU architecture

To optimize your application or library for specific CPU architectures, use the appropriate option listed in the table below.

CPU architecture compiler option Ivy Bridge -march=ivybridge Sandy Bridge -march=sandybridge Haswell -march=haswell Broadwell -march=broadwell Skylake -march=skylake-avx512 Cascade Lake -march=cascadelake Napels/Rome (AMD) -march=znver2 detect host CPU -march=native

Note: GCC doesn’t support applications with multiple code paths, so you have to build multiple versions optimized for specific architectures. Dispatching can be done at runtime by checking the value of the $VSC_ARCH_LOCAL environment variable.

GCC OpenMP

The compiler switch to use to compile/link OpenMP C/C++ or Fortran code is -fopenmp. For example, to com- pile/link a OpenMP C program scattter.c to an executable scatter, use:

$ gcc -fopenmp -O2 -o scatter scatter.c

Note: The OpenMP runtime library used by GCC is of inferior quality when compared to Intel’s, so developers are strongly encouraged to use the Intel toolchain when developing/building OpenMP software.

5.2. Development tools 55 VSC documentation Documentation, Release 1.0

Running an OpenMP job

When submitting a job, remember to specify as many processes per node as the number of threads the executable is supposed to run. This can be done using the ppn resource specification, e.g., -l nodes=1:ppn=10 for an executable that should be run with 10 OpenMP threads.

Warning: The number of threads should not exceed the number of cores on a compute node.

Communication library: Open MPI

For the foss toolchain, Open MPI is used as the communications library. To compile/link MPI programs, wrappers are supplied, so that the correct headers and libraries are used automatically. These wrappers are: •C: mpicc • C++: mpic++ • Fortran: mpif77, mpif90 The compiler wrappers take the same options as the corresponding compilers.

Using the MPI compilers from Open MPI

For example, to compile/link a C program thermo.c to an executable thermodynamics:

$ mpicc -O2 -o thermodynamics thermo.c

Extensive documentation is provided on the Open MPI documentation site.

Note: Do not forget to load the toolchain module first.

Running an Open MPI program

Note that an MPI program must be run with the exact same version of the toolchain as it was originally build with. The listing below shows a PBS job script thermodynamics.pbs that runs the thermodynamics executable.

#!/bin/bash -l module load foss/ cd $PBS_O_WORKDIR mpirun ./thermodynamics

The hosts and number of processes is retrieved from the queue system, that gets this information from the resource specification for that job.

FOSS mathematical libraries

The FOSS toolchain contains the basic HPC mathematical libraries, it offers: • OpenBLAS (Basic Linear Algebra Subprograms)

56 Chapter 5. Software development VSC documentation Documentation, Release 1.0

• LAPACK from the Netlib LAPACK repository (Linear Algebra PACKage) • ScaLAPACK from the Netlib ScaLAPACK repository (Scalable Linear Algebra PACKage) • FFTW (Fastest Fourier Transform in the West)

Version numbers

2021a 2020b 2020a 2019a 2018b 2018a 2017b 2017a 2016b 2016a GCC 10.3.0 10.2.0 9.3.0 8.2.0 7.3.0 6.4.0 6.4.0 6.3 5.4 4.9.3 Open MPI 4.1.1 4.0.5 4.0.3 3.1.3 3.1.1 2.1.2 2.1.1 2.0.2 1.10.3 1.10.2 UCX 1.10.0 1.9.0 OpenBLAS 0.3.15 0.3.12 0.3.9 0.3.5 0.3.1 0.2.20 0.2.20 0.2.19 0.2.18 0.2.15 ScaLAPACK 2.1.0 2.1.0 2.1.0 2.0.2 2.0.2 2.0.2 2.0.2 2.0.2 2.0.2 2.0.2 FFTW 3.3.9 3.3.8 3.3.8 3.3.8 3.3.8 3.3.7 3.3.6 3.3.6 3.3.4 3.3.4 binutils 2.36.1 2.35 2.34 2.32 2.30 2.28 2.28 2.27 2.26 2.25

Further information on FOSS components

• Overview of GCC documentation (all versions) • Open MPI documentation – 4.1.x (foss/2021a) – 4.0.x (foss/2020b and foss/2020a) – 3.1.x (foss/2018b and foss/2019a) – 2.1.x (foss/2017b and foss/2018a) – 2.0.x (foss/2017a) – 1.10.x (foss/2016b and foss/2016a) • The OpenBLAS project page and OpenBLAS Wiki • Generic BLAS/LAPACK/ScaLAPACK documentation • FFTW documentation • GNU binutils documentation

5.2.4 Intel Trace Analyzer & Collector

Purpose

Debugging MPI applications is notoriously hard. The Intel Trace Analyzer & Collector (ITAC) can be used to generate a trace while running an application, and visualizing it later for analysis.

Prerequisities

You will need an MPI program (C/C++ or Fortran) to instrument and run.

5.2. Development tools 57 VSC documentation Documentation, Release 1.0

Step by step

The following steps are the easiest way to use the Intel Trace Analyzer, however, more sophisticated options are available. 1. Load the relevant modules. The exact modules may differ from system to system, but will typically include the itac module and a compatible Intel toolchain, e.g.,

$ module load intel/2019a $ module load itac/2019.2.026

Note: Users of the UAntwerpen clusters should load the inteldevtools module instead, which makes also available Intel’s debugger, VTune, Advisor and Inspector development tools.

1. Compile your application so that it can generate a trace:

$ mpiicc -trace myapp.c -o myapp

where myapp.c is your C/C++ source code. For a Fortran program, this would be:

$ mpiifort -trace myapp.f -o myapp

2. Run your application using a PBS script such as this one:

#!/bin/bash -l #PBS -N myapp-job #PBS -l walltime=00:05:00 #PBS -l nodes=4

module load intel/2019a module load itac/2019.2.026 # Set environment variables for ITAC. # Unfortunately, the name of the script differs between versions of ITAC source $EBROOTITAC/bin/itacvars.sh

cd $PBS_O_WORKDIR

mpirun -trace ./myapp

3. When the job is finished, check whether files with names myapp.stf.* have been generated, if so, start the visual analyzer using:

$ traceanalyzer myapp.stf

Further information

Intel’s ITAC documentation

5.2.5 ParameterWeaver

Introduction & motivation

When working on the command line such as in the Bash shell, applications support command line flags and parameters. Many programming languages offer support to conveniently deal with command line arguments out of the box, e.g.,

58 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Python. However, quite a number of languages used in a scientific context, e.g., C/C++, Fortran, R, Matlab do not. Although those languages offer the necessary facilities, it is at best somewhat cumbersome to use them, and often the process is rather error prone. Quite a number of libraries have been developed over the years that can be used to conveniently handle command line arguments. However, this complicates the deployment of the application since it will have to rely on the presence of these libraries. ParameterWeaver has a different approach: it generates the necessary code to deal with the command line arguments of the application in the target language, so that these source files can be distributed along with those of the application. This implies that systems that don’t have ParameterWeaver installed still can run that application. Using ParameterWeaver is as simple as writing a definition file for the command line arguments, and executing the code generator via the command line. This can be conveniently integrated into a standard build process such as make. ParameterWeaver currently supports the following target languages: • C/C++ • Fortran 90 and later •R

High-level overview & concepts

Parameter definition files

A parameter definition file is a CSV text file where each line defines a parameter. A parameter has a type, a name, a default values, and optionally, a description. To add documentation, comments can be added to the definition file. The types are specific to the target language, e.g., an integer would be denoted by int for C/C++, and by integer for Fortran 90. The supported types are documented for each implemented target language. By way of illustration, a parameter definition file is given below for C as a target language, additional examples are shown in the target language specific sections:

int,numParticles,1000,number of particles in the system double,temperature,273,system temperature in Kelvin char*,intMethod,'newton',integration method to use

Note that this parameter definition file should be viewed as an integral part of the source code.

Code generation

ParameterWeaver will generate code to 1. initialize the parameter variables to the default values as specified in the parameter definition file; 2. parse the actual command line arguments at runtime to determine the user specified values, and 3. print the values of the parameters to an output stream. The implementation and features of the resulting code fragments are specific to the target language, and try to be as close as possible to the idioms of that language. Again, this is documented for each target language specifically. The nature and number of these code fragments varies from one target language to the other, again trying to match the language’s idioms as closely as possible. For C/C++, a declaration file (.h) and a definition file (.c), while for Fortran 90 a single file (.f90 will be generated that contains both declarations and definitions.

5.2. Development tools 59 VSC documentation Documentation, Release 1.0

Language specific documentation

C/C++ documentation

Data types

For C/C++, ParameterWeaver supports the following data types: 1. int 2. long 3. float 4. double 5. bool 6. char *

Example C program

Suppose we want to pass command line parameters to the following C program:

#include #include #include int main(int argc, char *argv[]) { FILE *fp; int i; if (strlen(out)>0){ fp= fopen(out, \"w \"); } else { fp= stdout; } if (verbose) { fprintf(fp, \"# n = %d\\n\", n); fprintf(fp, \"# alpha = %.16f\\n\", alpha); fprintf(fp, \"# out =' %s'\\n\", out); fprintf(fp, \"# verbose = %s\\n\", verbose); } for (i=0; i< n; i++){ fprintf(fp, \"%d\\t%f\\n\", i, i*alpha); } if (fp != stdout) { fclose(fp); } return EXIT_SUCCESS; }

We would like to set the number of iterations n, the factor alpha, the name of the file to write the output to out and the verbosity verbose at runtime, i.e., without modifying the source code of this program. Moreover, the code to print the values of the variables is error prone, if we later add or remove a parameter, this part of the code has to be updated as well. Defining the command line parameters in a parameter definition file to automatically generate the necessary code simplifies matters considerably.

60 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Example parameter definition file

The following file defines four command line parameters named n, alpha, out and verbose. They are to be interpreted as int, double, char pointer and bool respectively, and if no values are passed via the command line, they will have the default values 10, 0.19, output.txt and false respectively. Note that a string default value is quoted. In this case, the columns in the file are separated by tab characters. The following is the contents of the parameter definition file param_defs.txt:

intn 10 double alpha 0.19 char * out'output.txt' bool verbose false

This parameter definition file can be created in a text editor such as the one used to write C program, or from a Microsoft Excel worksheet by saving the latter as a CSV file. As mentioned above, boolean values are also supported, however, the semantics is slightly different from other data types. The default value of a logical variable is always false, regardless of what is specified in the parameter definition file. As opposed to parameters of other types, a logical parameter acts like a flag, i.e., it is a command line options that doesn’t take a value. Its absence is interpreted as false, its presence as true. Also note that using a parameter of type bool implies that the program will have to be complied as C99, rather than C89. All modern compiler fully support C99, so that should not be an issue. However, if your program needs to adhere strictly to the C89 standard, simply use a parameter of type int instead, with 0 interpreted as false, all other values as true. In that case, the option takes a value on the command line.

Generating code

Generating the code fragments is now very easy. If appropriate, load the module (VIC3):

$ module load parameter-weaver

Next, we generate the code based on the parameter definition file:

$ weave -l C -d param_defs.txt

A number of type declarations and functions are generated, the declarations in the header file cl_params.h, the definitions in the source file cl_params.c. 1. data structure: a type Params is defined as a typedef of a struct with the parameters as fields, e.g.,

typedef struct { int n; double alpha; char *out; bool verbose; } Params;

2. Initialization function: the default values of the command line parameters are assigned to the fields of the Params variable, the address of which is passed to the function 3. Parsing: the options passed to the program via the command line are assigned to the appropriate fields of the Params variable. Moreover, the argv array containing the remaining command line arguments, the argc variable is set appropriately. 4. Dumper: a function is defined that takes three arguments: a file pointer, a prefix and the address of a Params variable. This function writes the values of the command line parameters to the file pointer, each on a separate line, preceeded by the specified prefix.

5.2. Development tools 61 VSC documentation Documentation, Release 1.0

5. Finalizer: a function that deallocates memory allocated in the initialization or the parsing functions to avoid memory leaks.

Using the code fragments

The declarations are simply included using preprocessor directives:

#include \"cl_params.h\"

A variable to hold the parameters has to be defined and its values initialized:

Params params; initCL(¶ms);

Next, the command line parameters are parsed and their values assigned:

parseCL(¶ms,&argc,&argv);

The dumper can be called whenever the user likes, e.g.,

dumpCL(stdout, \"\", ¶ms);

The code for the program is thus modified as follows:

#include #include #include #include \"cl_params.h\" int main(int argc, char *argv[]) { FILE *fp; int i; Params params; initCL(¶ms); parseCL(¶ms,&argc,&argv); if (strlen(params.out)>0){ fp= fopen(params.out, \"w \"); } else { fp= stdout; } if (params.verbose) { dumpCL(fp, \"# \", ¶ms); } for (i=0; i< params.n; i++){ fprintf(fp, \"%d\\t%f\\n\", i, i*params.alpha); } if (fp != stdout) { fclose(fp); } finalizeCL(¶ms); return EXIT_SUCCESS; }

Note that in this example, additional command line parameters are simply ignored. As mentioned before, they are available in the array argv, argv[0] will hold the programs name, subsequent elements up to argc - 1 contain the remaining command line parameters.

62 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Fortran 90 documentation

Data types

For Fortran 90, ParameterWeaver supports the following data types: 1. integer 2. real 3. double precision 4. logical 5. character(len=1024)

Example Fortran 90 program

Suppose we want to pass command line parameters to the following Fortran program:

program main use iso_fortran_env implicit none integer :: unit_nr=8, i, istat if (len(trim(out))>0) then open(unit=unit_nr, file=trim(out), action=\"write\") else unit_nr= output_unit end if if (verbose) then write (unit_nr, \"(A, I20)\") \"# n = \", n write (unit_nr, \"(A, F24.15)\") \"# alpha = \", alpha write (unit_nr, \"(A,'''', A,'''') \") \"# out = \", out write (unit_nr, \"(A, L)\") \"# verbose = \", verbose end if do i=1, n write (unit_nr, \"(, F5.2)\") i, i*alpha end do if (unit_nr/= output_unit) then close(unit=unit_nr) end if stop end program main

We would like to set the number of iterations n, the factor alpha, the name of the file to write the output to out and the verbosity verbose at runtime, i.e., without modifying the source code of this program. Moreover, the code to print the values of the variables is error prone, if we later add or remove a parameter, this part of the code has to be updated as well. Defining the command line parameters in a parameter definition file to automatically generate the necessary code simplifies matters considerably.

Example parameter definition file

The following file defines four command line parameters named n, alpha, out and verbose. They are to be interpreted as integer, double precision, character(len=1024) pointer and logical respectively,

5.2. Development tools 63 VSC documentation Documentation, Release 1.0 and if no values are passed via the command line, they will have the default values 10, 0.19, output.txt and false respectively. Note that a string default value is quoted. In this case, the columns in the file are separated by tab characters. The following is the contents of the parameter definition file param_defs.txt: integer n 10 double precision alpha 0.19 character(len=1024) out'output.txt' logical verbose false

This parameter definition file can be created in a text editor such as the one used to write the Fortran program, or from a Microsoft Excel worksheet by saving the latter as a CSV file. As mentioned above, logical values are also supported, however, the semantics is slightly different from other data types. The default value of a logical variable is always false, regardless of what is specified in the parameter definition file. As opposed to parameters of other types, a logical parameter acts like a flag, i.e., it is a command line options that doesn’t take a value. Its absence is interpreted as false, its presence as true.

Generating code

Generating the code fragments is now very easy. If appropriate, load the module (VIC3):

$ module load parameter-weaver

Next, we generate the code based on the parameter definition file:

$ weave -l Fortran -d param_defs.txt

A number of type declarations and functions are generated in the module file cl_params.f90. 1. data structure: a type params_type is defined as a structure with the parameters as fields, e.g.,

type :: params_type integer :: n double precision :: alpha character(len=1024) :: out logical :: verbose end type params_type

2. Initialization function: the default values of the command line parameters are assigned to the fields of the params_type variable 3. Parsing: the options passed to the program via the command line are assigned to the appropriate fields of the params_type variable. Moreover, the next variable of type integer will hold the index of the next command line parameter, i.e., the first of the remaining command line parameters that was not handled by the parsing function. 4. Dumper: a function is defined that takes three arguments: a unit number for output, a prefix and the params_type variable. This function writes the values of the command line parameters to the output stream associated with the unit number, each on a separate line, preceded by the specified prefix.

Using the code fragments

The module file is included by the use directive: use cl_parser

64 Chapter 5. Software development VSC documentation Documentation, Release 1.0

A variable to hold the parameters has to be defined and its values initialized: type(params_type) :: params call init_cl(params)

Next, the command line parameters are parsed and their values assigned: integer :: next call parse_cl(params, next)

The dumper can be called whenever the user likes, e.g., call dump_cl(output_unit, \"\", params)

The code for the program is thus modified as follows: program main use cl_params use iso_fortran_env implicit none type(params_type) :: params integer :: unit_nr=8, i, istat, next call init_cl(params) call parse_cl(params, next) if (len(trim(params% out))>0) then open(unit=unit_nr, file=trim(params% out), action=\"write \") else unit_nr= output_unit end if if (params% verbose) then call dump_cl(unit_nr, \"# \", params) end if do i=1, params%n write (unit_nr, \"(I3, F5.2)\") i, i*params % alpha end do if (unit_nr/= output_unit) then close(unit=unit_nr) end if stop end program main

Note that in this example, additional command line parameters are simply ignored. As mentioned before, they are available using the standard get_command_argument function, starting from the value of the variable next set by the call to parse_cl.

R documentation

Data types

For R, ParameterWeaver supports the following data types: 1. integer 2. double 3. logical 4. string

5.2. Development tools 65 VSC documentation Documentation, Release 1.0

Example R script

Suppose we want to pass command line parameters to the following R script:

if (nchar(out)>0){ conn<- file(out,'w') } else { conn= stdout() } if (verbose) { write(sprintf(\"# n = %d\\n\", n), conn) write(sprintf(\"# alpha = %.16f\\n\", alpha), conn) write(sprintf(\"# out =' %s'\\n\", out), conn) write(sprintf(\"# verbose = %s\\n\", verbose), conn) } for (i in 1:n) { write(sprintf(\"%d\\t%f\\n\", i, i*alpha), conn) } if (conn != stdout()) { close(conn) }

We would like to set the number of iterations n, the factor alpha, the name of the file to write the output to out and the verbosity verbose at runtime, i.e., without modifying the source code of this script. Moreover, the code to print the values of the variables is error prone, if we later add or remove a parameter, this part of the code has to be updated as well. Defining the command line parameters in a parameter definition file to automatically generate the necessary code simplifies matters considerably.

Example parameter definition file

The following file defines four command line parameters named n, alpha, out and verbose. They are to be interpreted as integer, double, string and logical respectively, and if no values are passed via the command line, they will have the default values 10, 0.19, output.txt and false respectively. Note that a string default value is quoted, just as it would be in R code. In this case, the columns in the file are separated by tab characters. The following is the contents of the parameter definition file param_defs.txt:

integer n 10 double alpha 0.19 string out'output.txt' logical verbose F

This parameter definition file can be created in a text editor such as the one used to write R scripts, or from a Microsoft Excel worksheet by saving the latter as a CSV file. As mentioned above, logical values are also supported, however, the semantics is slightly different from other data types. The default value of a logical variable is always false, regardless of what is specified in the parameter definition file. As opposed to parameters of other types, a logical parameter acts like a flag, i.e., it is a command line options that doesn’t take a value. Its absence is interpreted as false, its presence as true.

Generating code

Generating the code fragments is now very easy. If appropriate, load the module (VIC3):

66 Chapter 5. Software development VSC documentation Documentation, Release 1.0

$ module load parameter-weaver

Next, we generate the code based on the parameter definition file:

$ weave -l R -d param_defs.txt

Three code fragments are generated, all grouped in a single R file cl_params.r. 1. Initialization: the default values of the command line parameters are assigned to global variables with the names as specified in the parameter definition file. 2. Parsing: the options passed to the program via the command line are assigned to the appropriate variables. Moreover, an array containing the remaining command line arguments is created as cl_params. 3. Dumper: a function is defined that takes two arguments: a file connector and a prefix. This function writes the values of the command line parameters to the file connector, each on a separate line, preceded by the specified prefix.

Using the code fragments

The code fragments can be included into the R script by sourcing it:

source(\"cl_parser.r\")

The parameter initialization and parsing are executed at this point, the dumper can be called whenever the user likes, e.g.,

dump_cl(stdout(), \"\")

The code for the script is thus modified as follows:

source('cl_params.r') if (nchar(out)>0){ conn<- file(out,'w') } else { conn= stdout() } if (verbose) { dump_cl(conn, \"# \") } for (i in 1:n) { cat(paste(i, \"\\t\", i*alpha), file = conn, sep = \"\\n\") } if (conn != stdout()) { close(conn) }

Note that in this example, additional command line parameters are simply ignored. As mentioned before, they are available in the vector cl_params if needed.

Octave documentation

Data types

For Octave, ParameterWeaver supports the following data types:

5.2. Development tools 67 VSC documentation Documentation, Release 1.0

1. double 2. logical 3. string

Example Octave script

Suppose we want to pass command line parameters to the following Octave script:

if (size(out)>0) fid= fopen(out, \"w \"); else fid= stdout; end if (verbose) fprintf(fid, \"# n = %.16f\\n\", prefix, params.n); fprintf(fid, \"# alpha = %.16f\\n\", alpha); fprintf(fid, \"# out =' %s'\\n\", out); fprintf(fid, \"# verbose = %1d\\n\", verbose); end for i=1:n fprintf(fid, \"%d\\t%f\\n\", i, i*alpha); end if (fid != stdout) fclose(fid); end

We would like to set the number of iterations n, the factor alpha, the name of the file to write the output to out and the verbosity verbose at runtime, i.e., without modifying the source code of this script. Moreover, the code to print the values of the variables is error prone, if we later add or remove a parameter, this part of the code has to be updated as well. Defining the command line parameters in a parameter definition file to automatically generate the necessary code simplifies matters considerably.

Example parameter definition file

The following file defines four command line parameters named n, alpha, out and verbose. They are to be interpreted as double, double, string and logical respectively, and if no values are passed via the command line, they will have the default values 10, 0.19, output.txt and false respectively. Note that a string default value is quoted, just as it would be in Octave code. In this case, the columns in the file are separated by tab characters. The following is the contents of the parameter definition file param_defs.txt:

double n 10 double alpha 0.19 string out'output.txt' logical verbose F

This parameter definition file can be created in a text editor such as the one used to write Octave scripts, or from a Microsoft Excel worksheet by saving the latter as a CSV file. As mentioned above, logical values are also supported, however, the semantics is slightly different from other data types. The default value of a logical variable is always false, regardless of what is specified in the parameter definition file. As opposed to parameters of other types, a logical parameter acts like a flag, i.e., it is a command line options that doesn’t take a value. Its absence is interpreted as false, its presence as true.

68 Chapter 5. Software development VSC documentation Documentation, Release 1.0

Generating code

Generating the code fragments is now very easy. If appropriate, load the module (VIC3):

$ module load parameter-weaver

Next, we generate the code based on the parameter definition file:

$ weave -l octave -d param_defs.txt

Three code fragments are generated, each in its own file, i.e., init_cl.m, parse_cl.m, and dump_cl.m.r. 1. Initialization: the default values of the command line parameters are assigned to global variables with the names as specified in the parameter definition file. 2. Parsing: the options passed to the program via the command line are assigned to the appropriate variables. Moreover, an array containing the remaining command line arguments is returned as the second value from parse_cl. 3. Dumper: a function is defined that takes two arguments: a file connector and a prefix. This function writes the values of the command line parameters to the file connector, each on a separate line, preceded by the specified prefix.

Using the code fragments

The generated functions can be used by simply calling them from the main script. The code for the script is thus modified as follows:

params= init_cl(); params= parse_cl(params); if (size(params.out)>0) fid= fopen(params.out, \"w \"); else fid= stdout; end if (params.verbose) dump_cl(stdout, \"# \", params); end for i=1:params.n fprintf(fid, \"%d\\t%f\\n\", i, i*params.alpha); end if (fid != stdout) fclose(fid); end

Note that in this example, additional command line parameters are simply ignored. As mentioned before, they are can be obtained as the second return value from the call to parse_cl.

Future work

The following features are planned in future releases: • Additional target languages: – Matlab – Java

5.2. Development tools 69 VSC documentation Documentation, Release 1.0

Support for Perl and Python is not planned, since these language have facilities to deal with command line arguments in their respective standard libraries. • Configuration files are an alternative way to specify parameters for an application, so ParameterWeaver will also support this in a future release.

Contact & support

Bug reports and feature request can be sent to Geert Jan Bex.

5.2.6 Version control systems

Why use a version control system?

A version control system (VCS) helps you to manage the changes to the source files of your project, and most systems also support team development. Since it remembers the history of your files, you can always return to an earlier version if you’ve screwed up making changes. By adding comments when you store a new version in the VCS it also becomes much easier to track which change was made for what purpose at what time. And if you develop in a team, it helps to organise making coordinated changes to the code base, and it supports co-development even across file system borders (e.g., when working with a remote partner). Most Integrated Development Environments (IDE) offer support for one or more version control systems. E.g., Eclipse, the IDE which we recommend for the development of C/C++ or Fortran codes on clusters, supports all of the systems mentioned on this page, some out-of-the-box and others by adding an additional package. The systems mentioned on this page are all available on Linux, macOS and Windows through the Windows Subsystem for Linux (WSL).

Types of version control systems

An excellent introduction to the various types of version control systems can be found in the book Pro GIT by Scott Chacon and Ben Straub.

Centralised systems

Centralised version control systems were developed to enable people to collaborate on code and documents with people on different systems that may not share a common file system. The version files are now maintained by a server to which multiple clients can connect and check out files, and the systems help to manage concurrent changes to a file by several users (through a copy-modify-merge procedure). Popular examples of this type are CVS (Concurrent Versions System) and SVN (Subversion). Of those two, SVN is the more recent system while CVS is no longer further developed and less and less used. Links: • SVN Wikipedia page • SVN implementations – Command-line clients are included in most Linux distributions and macOS and Windows (WSL). The command line client is also available on the VSC clusters. – TortoiseSVN (or go straight to the TortoiseSVN web site) is a popular Windows native GUI client that integrates well with the explorer. However, if you google on " SVN GUI” you’ll find a plethora of other choices, not only for Windows but also for macOS and Linux.

70 Chapter 5. Software development VSC documentation Documentation, Release 1.0

– SVN can be integrated with the Eclipse IDE through the "Subversive SVN team provider” plugin which can be installed through the "Install New Software” panel in the help menu. More information and instructions are available on the Subversive subsite of the main Eclipse web site.

Distributed systems

The weak point of the centralised systems is that they require you to be online to checkout a file or to commit a revision. In a distributed system, the clients mirror the complete repository and not just the latest version of each file. When online, the user can then synchronise the local repository with the copy on a server. In a single-user scenario you can still keep all your files in the local repository without using a server, and hence it doesn’t make sense anymore to still use one of old local-only version control systems. The disadvantage of a distributed system is that you are not forced to synchronise after every commit, so that the local repositories of various users on a project can be very much out-of-sync with each other, making the job harder when those versions have to be merged again. Popular examples of systems of this type are Git (originally developed to manage the Linux kernel project) and Mercurial (sometimes abbreviated as Hg, chemists will understand why). Links: • Git on Wikipedia • Main Git web page • Git implementations – If you have a Linux system, Git is most likely already installed on your system. On macOS, git is available through Xcode, though it is not always the most recent version. On Windows, you can use WSL. Down- loads for all systems are also available on the download section of the main git web site. That page also contains links to a number of GUI options. Most if not all GUI tools store projects in a way that is fully compatible with the command line tools, so you can use both simultaneously. The command line client is also available on the VSC clusters. – Another nice GUI application is SourceTree produced by Atlassian. Atlassian is the company behind the Bitbucket cloud service, but their tool also works well with GitHub, one of their main competitors. It has a very nice way of representing the history of a local repository. – The Eclipse IDE comes with built-in support for Git through the standard plug-in EGit. More recent versions of this plugin may be available through the Eclipse Marketplace.

Cloud services

Many companies offer hosting services for SVN, Git or Mercurial repositories in the cloud. Several offer free public hosting for Open Source projects or have free access for academic accounts. Some noteworthy ones that are popular for academic projects are: • GitHub (.com) offers free Git and Subversion hosting for Open Source projects. We use this service for some VSC in-house tools development. It is also possible to host private projects if you subscribe to one of their paying plans or register as an academic usser. • GitLab (gitlab.com) also offers free public and private repositories. • SourceForge is a very well known service for hosting Open Source projects. It currently supports projects managed through Subversion, Git, Mercurial and a few other systems. However, we urge you to always carefully check the terms-of-use of these services to assure that, e.g., the way they deal with intellectual property is in line with your institute’s requirements. Also note that some institutes provide version control services.

5.2. Development tools 71 VSC documentation Documentation, Release 1.0

Which one should I use?

It is not up to us to make this choice for you, but here are a number of elements that you should take into account: • Use a cloud service if you can, since this ensures that your code is safely stored off-site. However, verify this is in line with the intellectual property policies of your institute. • Subversion, Git and Mercurial are all recent systems that are well maintained and supported by several hosting services. Git currently seems the most popular choice and the VSC provides training sessions on Git. • Subversion and Git are installed on most VSC systems. We use Git ourselves for some of our in-house develop- ment. • Centralised version management systems have a simpler concept than the distributed ones, but if you expect prolonged periods that you are offline, you have to keep in mind that you cannot make any commits during that period. • As you have only a single copy of the repository in a centralised system, a reliable hosting service or a good backup strategy is important. In a distributed system it would still be possible to reconstruct the contents of a repository from the other repositories. • If you want to use an IDE, it is good to check which systems are supported by the IDE. E.g., Eclipse supports Git out-of-the-box, and Subversion and Mercurial through a plug-in. Visual Studio also supports all three of these systems.

5.3 Libraries

5.3.1 BLAS and LAPACK

Scope

On modern CPUs the actual performance of a program depends very much on making optimal use of the caches. Many standard mathematical algorithms have been coded in standard libraries, and several vendors and research groups build optimised versions of those libraries for certain computers. They are key to extracting optimal performance from modern processors. Don’t think you can write a better dense matrix-matrix multiplication routine or dense matrix solver than the specialists (unless you’re a real specialist yourself)! Many codes use dense linear algebra routines. Hence it is no suprise that in this field, collaboration lead to the definition of a lot of standard functions and many groups worked hard to build optimal implementations: • BLAS (Basic Linear Algebra Subprograms) is a library of vector, vector-vector, matrix-vector and matrix-matrix operations. • LAPACK, a library of dense and banded matrix linear algebra routines such as solving linear systems, the eigenvalue- and singular value decomposition. LAPACK95 defines Fortran95 interfaces for all routines. • ScaLAPACK is a distributed memory parallel library offering some functionality similar to LAPACK. Reference Fortran implementations do exist, so you can always recompile code using these libraries on systems on which the libraries are not available.

BLAS and LAPACK at the VSC

We provide BLAS and LAPACK routines through the toolchains. Hence the instructions for linking with the libraries are given on the toolchains page.

72 Chapter 5. Software development VSC documentation Documentation, Release 1.0

• The Intel toolchain provides the BLAS, LAPACK and ScaLAPACK interfaces through the Intel Math Kernel Library (MKL) • The FOSS toolchain provides open source implementations: – the OpenBLAS BLAS library – the standard LAPACK implementation – the standard ScaLAPACK implementation

Links

• The LAPACK, LAPACK95 and ScaLAPACK manuals are published by SIAM, but there are online HTML versions available on Netlib (the repository that also contains the reference Fortran implementations): – LAPACK user guide in the Netlib BLAS repository – LAPACK95 user guide in the Netlib LAPACK repository – ScaLAPACK user guide in the Netlib ScaLAPACK repository • Documentation about specific implementations is available on the Toolchains. – Intel toolchain – FOSS toolchain

5.3.2 Perl package management

Introduction

(Note: the Perl community uses the term ‘modules’ rather than ‘packages’, however, in the documentation, we use the term ‘packages’ to try and avoid confusion with the module system for loading software.) Perl comes with an extensive standard library, and you are strongly encouraged to use those packages as much as possible, since this will ensure that your code can be run on any platform that supports Perl. However, many useful extensions to and libraries for Perl come in the form of packages that can be installed separately. Some of those are part of the default installation on VSC infrastructure. Given the astounding number of packages, it is not sustainable to install each and everyone system wide. Since it is very easy for a user to install them just for himself, or for his research group, that is not a problem though. Do not hesitate to contact support whenever you encounter trouble doing so.

Checking for installed Perl packages

To check which Perl packages are installed, the cpan utility is useful. It will list all packages that are installed for the Perl distribution you are using, including those installed by you, i.e., those in your PERL5LIB environment variable. 1. Load the module for the Perl version you wish to use, e.g.,

$ module load Perl/5.28.2-foss-2018a

2. Run cpan:

$ cpan -l

5.3. Libraries 73 VSC documentation Documentation, Release 1.0

Installing your own Perl packages

Setting up your own package repository for Perl is straightforward. For this purpose, the cpan utility first needs to be configured. Replace the paths /user/leuven/301/vsc30140 and /data/leuven/301/vsc30140 by the ones to your own home and data directories. 1. Load the appropriate Perl module, e.g.,

$ module load Perl/5.28.2-foss-2018a

2. Create a directory to install in, i.e.,

$ mkdir $VSC_DATA/perl5

3. Run cpan:

$ cpan

4. Configure internet access and mirror sites:

cpan[1]> o conf init connect_to_internet_ok urllist

5. Set the install base, i.e., directory created above:

cpan[2]> o conf makepl_arg INSTALL_BASE=/data/leuven/301/vsc30140/perl5

Note that you can not use an environment variable for the path. 6. Set the preference directory path:

cpan[3]> o conf prefs_dir/user/leuven/301/vsc30140/.cpan/prefs

7. Commit changes so that they are stored in ~/.cpan/CPAN/MyConfig.pm, i.e.,

cpan[4]> o conf commit

8. Quit cpan:

cpan[5]>q

Now Perl packages can be installed easily, e.g.,

$ cpan IO::Scalar

Note that this will install all dependencies as needed, though you may be prompted. To effortlessly use locally installed packages, install the local::lib package first, and use the following code fragment in Perl scripts that depend on locally installed packages.

use local::lib;

5.3.3 Python package management

Introduction

Python comes with an extensive standard library, and you are strongly encouraged to use those packages as much as possible, since this will ensure that your code can be run on any platform that supports Python.

74 Chapter 5. Software development VSC documentation Documentation, Release 1.0

However, many useful extensions to and libraries for Python come in the form of packages that can be installed separately. Some of those are part of the default installation on VSC infrastructure, others have been made available through the module system and must be loaded explicitly. Given the astounding number of packages, it is not sustainable to install each and everyone system wide. Since it is very easy for a user to install them just for himself, or for his research group, that is not a problem though. Do not hesitate to contact support whenever you encounter trouble doing so.

Checking for installed packages

To check which Python packages are installed, the pip utility is useful. It will list all packages that are installed for the Python distribution you are using, including those installed by you, i.e., those in your PYTHONPATH environment variable. 1. Load the module for the Python version you wish to use, e.g.,:

$ module load Python/3.7.0-foss-2018b

2. Run pip:

$ pip freeze

Note that some packages, e.g., mpi4py, pyh5, pytables,. . . , are available through the module system, and have to be loaded separately. These packages will not be listed by pip unless you loaded the corresponding module. In recent toolchains, many of the packages you need for scientific computing have been bundled into the SciPy-bundle module.

Install Python packages using conda

Note: Conda packages are incompatible with the software modules. Usage of conda is discouraged in the clusters at UAntwerpen, UGent, and VUB.

The easiest way to install and manage your own Python environment is conda. Using conda has some major advan- tages. • You can create project-specific environments that can be shared with others and (up to a point) across platforms. This makes it easier to ensure that your experiments are reproducible. • conda takes care of the dependencies, up to the level of system libraries. This makes it very easy to install packages. However, this last advantage is also a potential drawback: you have to review the libraries that conda installs because they may not have been optimized for the hardware you are using. For linear algebra, conda will typically use Intel MKL runtime libraries, giving you performance that is on par with the Python modules for numpy and scipy. However, care has to be taken in a number of situations. When you require mpi4py, conda will typically use a library that is not configured and optimized for the networks used in our clusters, and the performance impact is quite severe. Another example is TensorFlow when running on CPUs, the default package is not optimized for the CPUs in our infrastructure, and will run sub-optimally. (Note that this is not the case when you run TensorFlow on GPUs, since conda will install the appropriate CUDA libraries.) These issues can be avoided by using Intel’s Python distribution that contains Intel MPI and optimized versions of packages such as scikit-learn and TensorFlow. You will find installation instructions provided by Intel.

5.3. Libraries 75 VSC documentation Documentation, Release 1.0

Install Miniconda

If you have Miniconda already installed, you can skip ahead to the next section, if Miniconda is not installed, we start with that. Download the Bash script that will install it from conda.io using, e.g., wget:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

Once downloaded, run the installation script:

$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $VSC_DATA/miniconda3

Warning: It is important to use $VSC_DATA to store your conda installation since environments tend to be large, and your quota in $VSC_HOME would be exceeded soon.

Optionally, you can add the path to the Miniconda installation to the PATH environment variable in your .bashrc file. This is convenient, but may lead to conflicts when working with the module system, so make sure that you know what you are doing in either case. The line to add to your .bashrc file would be: export PATH="${VSC_DATA}/miniconda3/bin:${PATH}"

Create an environment

First, ensure that the Miniconda installation is in your PATH environment variable. The following command should return the full path to the conda command:

$ which conda

If the result is blank, or reports that conda can not be found, modify the PATH environment variable appropriately by adding Miniconda’s bin directory to PATH. You can create an environment based on the default conda channels, but it is recommended to at least consider the Intel Python distribution. Intel provides instructions on how to install the Intel Python distribution. Alternatively, to creating a new conda environment based on the default channels: $ conda create -n science numpy scipy matplotlib This command creates a new conda environment called science, and installs a number of Python packages that you will probably want to have handy in any case to preprocess, visualize, or postprocess your data. You can of course install more, depending on your requirements and personal taste. This will default to the latest Python 3 version, if you need a specific version, e.g., Python 2.7.x, this can be specified as follows:

$ conda create -n science python=2.7 numpy scipy matplotlib

Work with the environment

To work with an environment, you have to activate it. This is done with, e.g.,

76 Chapter 5. Software development VSC documentation Documentation, Release 1.0

$ source activate science

Here, science is the name of the environment you want to work in.

Install an additional package

To install an additional package, e.g., ‘pandas‘, first ensure that the environment you want to work in is activated.

$ source activate science

Next, install the package:

$ conda install tensorflow-gpu

Note that conda will take care of all dependencies, including non-Python libraries (e.g., cuDNN and CUDA for the example above). This ensures that you work in a consistent environment.

Update/remove a package

Using conda, it is easy to keep your packages up-to-date. Updating a single package (and its dependencies) can be done using:

$ conda update pandas

Updating all packages in the environment is trivial:

$ conda update --all

Removing an installed package:

$ conda remove tensorflow-gpu

Deactivate an environment

To deactivate a conda environment, i.e., return the shell to its original state, use the following command:

$ source deactivate

More information

Additional information about conda can be found on its documentation site.

Alternatives to conda

Setting up your own package repository for Python is straightforward. PyPi, the Python Package Index is a web repository of Python packages and you can easily install packages from it using either easy_install or pip. In both cases, you’ll have to create a subdirectory for Python in your ${VSC_DATA} directory, add this directory to your PYTHONPATH after loading a suitable Python module, and then point easy_install or pip to that directory as the install target rather then the default (which of course is write-protected on a multi-user system). Both commands will take care of dependencies also.

5.3. Libraries 77 VSC documentation Documentation, Release 1.0

Installing packages using easy_install

If you prefer to use easy_install, you can follow these instructions: 1. Load the appropriate Python module, i.e., the one you want the python package to be available for:

$ module load Python/3.7.0-foss-2018b

2. Create a directory to hold the packages you install, the last three directory names are mandatory:

$ mkdir -p "${VSC_DATA}/python_lib/lib/python3.7/site-packages/"

3. Add that directory to the PYTHONPATH environment variable for the current shell to do the installation:

$ export PYTHONPATH="${VSC_DATA}/python_lib/lib/python3.7/site-packages/:$

˓→{PYTHONPATH}"

4. Add the following to your .bashrc so that Python knows where to look next time you use it:

export PYTHONPATH="${VSC_DATA}/python_lib/lib/python3.7/site-packages/:$

˓→{PYTHONPATH}"

5. Install the package, using the --prefix option to specify the install path (this would install the sphinx pack- age):

$ easy_install --prefix="${VSC_DATA}/python_lib" sphinx

Installing packages using pip

If you prefer using pip, you can perform an install in your own directories as well by providing an install option. 1. Load the appropriate Python module, i.e., the one you want the python package to be available for:

$ module load Python/3.7.0-foss-2018b

2. Create a directory to hold the packages you install, the last three directory names are mandatory:

$ mkdir -p "${VSC_DATA}/python_lib/lib/python3.7/site-packages/"

3. Add that directory to the PYTHONPATH environment variable for the current shell to do the installation:

$ export PYTHONPATH="${VSC_DATA}/python_lib/lib/python3.7/site-packages/:$

˓→{PYTHONPATH}"

4. Add the following to your .bashrc so that Python knows where to look next time you use it:

export PYTHONPATH="${VSC_DATA}/python_lib/lib/python3.7/site-packages/:$

˓→{PYTHONPATH}"

5. Install the package, using the --prefix install option to specify the install path (this would install the sphinx package):

$ pip install --install-option="--prefix=${VSC_DATA}/python_lib" sphinx

For newer version of pip, you would use:

78 Chapter 5. Software development VSC documentation Documentation, Release 1.0

$ pip install --prefix="${VSC_DATA}/python_lib" sphinx

Installing Anaconda on NX node (KU Leuven Genius)

1. Before installing make sure that you do not have a .local/lib directory in your $VSC_HOME. In case it exists, please move it to some other location or temporary archive. It creates conflicts with Anaconda. 2. Download appropriate (64-Bit (x86) Linux Installer) version of Anaconda from https://www.anaconda.com/ products/individual#Downloads 3. Change the permissions of the file (if necessary):

$ chmod u+x Anaconda3-2019.07-Linux-x86_64.sh

4. Execute the installer:

$ ./Anaconda3-2019.07-Linux-x86_64.sh

You will be asked for to accept the license agreement, choose the location where it should be installed (please choose your $VSC_DATA). After installation is done you can choose to installer to add the Anaconda path to your .bashrc. We recommend not to do that as it will prevent creating NX desktops. Instead of that you can manually (or in another script) modify your path when you want to use Anaconda:

export PATH="${VSC_DATA}/anaconda3/bin:$PATH"

5. Go to the directory where Anaconda is installed and check for updates, e.g.,:

$ cd anaconda3/bin/ $ conda update anaconda-navigator

6. You can start the navigator from that directory with:

$ ./anaconda-navigator

5.3.4 R package management

Introduction

Most of the useful R packages come in the form of packages that can be installed separately. Some of those are part of the default installation on VSC infrastructure. Given the astounding number of packages, it is not sustainable to install each and everyone system wide. Since it is very easy for a user to install them just for himself, or for his research group, that is not a problem though. Do not hesitate to contact support whenever you encounter trouble doing so.

Standard R package installation

Setting up your own package repository for R is straightforward. 1. Load the appropriate R module, i.e., the one you want the R package to be available for:

$ module load R/3.2.1-foss-2014a-x11-tcl

2. Start R and install the package:

5.3. Libraries 79 VSC documentation Documentation, Release 1.0

> install.packages("DEoptim")

Some R packages depend on libraries installed on the system. In that case, you first have to load the modules for these libraries, and only then proceed to the R package installation. For instance, if you would like to install the gsl R package, you would first have to load the module for the GSL library, .e.g.,

$ module load GSL/2.5-GCC-6.4.0-2.28

Note that R packages often depend on the specific R version they were installed for, so you may need to reinstall them for other versions of R.

Installing R packages using conda

Note: Conda packages are incompatible with the software modules. Usage of conda is discouraged in the clusters at UAntwerpen, UGent, and VUB.

The easiest way to install and manage your own R environment is conda.

Installing Miniconda

If you have Miniconda already installed, you can skip ahead to the next section, if Miniconda is not installed, we start with that. Download the Bash script that will install it from conda.io using, e.g., wget:

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh

Once downloaded, run the installation script:

$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $VSC_DATA/miniconda3

Optionally, you can add the path to the Miniconda installation to the PATH environment variable in your .bashrc file. This is convenient, but may lead to conflicts when working with the module system, so make sure that you know what you are doing in either case. The line to add to your .bashrc file would be: export PATH="${VSC_DATA}/miniconda3/bin:${PATH}"

Creating an environment

First, ensure that the Miniconda installation is in your PATH environment variable. The following command should return the full path to the conda command:

$ which conda

If the result is blank, or reports that conda can not be found, modify the ‘PATH‘ environment variable appropriately by adding miniconda’s bin directory to PATH. Creating a new conda environment is straightforward:

$ conda create -n science -c r r-essentials r-rodbc

80 Chapter 5. Software development VSC documentation Documentation, Release 1.0

This command creates a new conda environment called science, and installs a number of R packages that you will probably want to have handy in any case to preprocess, visualize, or postprocess your data. You can of course install more, depending on your requirements and personal taste.

Working with the environment

To work with an environment, you have to activate it. This is done with, e.g.,

$ source activate science

Here, science is the name of the environment you want to work in.

Install an additional package

To install an additional package, e.g., ‘pandas‘, first ensure that the environment you want to work in is activated.

$ source activate science

Next, install the package:

$ conda install -c r r-ggplot2

Note that conda will take care of all dependencies, including non-R libraries. This ensures that you work in a consistent environment.

Updating/removing

Using conda, it is easy to keep your packages up-to-date. Updating a single package (and its dependencies) can be done using:

$ conda update r-rodbc

Updating all packages in the environment is trivial:

$ conda update --all

Removing an installed package:

$ conda remove r-mass

Deactivating an environment

To deactivate a conda environment, i.e., return the shell to its original state, use the following command

$ source deactivate

More information

Additional information about conda can be found on its documentation site.

5.3. Libraries 81 VSC documentation Documentation, Release 1.0

5.4 Integrating code with software packages

5.4.1 R integrating C functions

Purpose

Although R is a nice and fairly complete software package for statistical analysis, there are nevertheless situations where it desirable to extend R. This may be either to add functionality that is implemented in some C library, or to eliminate performance bottlenecks in R code. In this how-to it is assumed that the users wants to call his own C functions from R.

Prerequisites

It is assumed that the reader is familiar with the use of R as well as R scripting, and is a reasonably proficient C programmer. Specifically the reader should be familiar with the use of pointers in C.

Integration step by step

Before all else, first load the appropriate R module to prepare your environment, e.g.,

$ module load R

If you want a specific version of R, you can first check which versions are available using

$ module av R

and then load the appropriate version of the module, e.g.,

$ module load R/3.1.1-intel-2014b

A first example

No tutorial is complete without the mandatory ‘hello world’ example. The C code in file ‘myRLib.c’ is shown below:

#include void sayHello(int *n) { int i; for (i=0; i< *n; i++) Rprintf(\"hello world!\\n\"); }

Three things should be noted at this point 1. the ‘R.h’ header file has to be included, this file is part of the R distribution, and R knows where to find it; 2. function parameters are always pointers; and 3. to print to the R console, ‘Rprintf’ rather than ‘printf’ should be used. From this ‘myRLib.c’ file a shared library can be build in one convenient step:

$ R CMD SHLIB myRlib.c

82 Chapter 5. Software development VSC documentation Documentation, Release 1.0

If all goes well, i.e., if the source code has no syntax errors and all functions have been defined, this command will produce a shared library called ‘myRLib.so’. To use this function from within R in a convenient way, a simple R wrapper can be defined in ‘myRLib.R’: dyn.load(\"myRLib.so\"); sayHello<- function(n) { .C(\"sayHello\", as.integer(n)) }

In this script, the first line loads the share library containing the ‘sayHello’ function. The second line defines a convenient wrapper to simplify calling the C function from R. The C function is called using the ‘.C’ function. The latter’s first parameter is the name of the C function to be called, i.e., ‘sayHello’, all other parameters will be passed to the C function, i.e., the number of times that ‘sayHello’ will say hello as an integer. Now, R can be started to be used interactively as usual, i.e.,

$ R

In R, we first source the library’s definitions in ‘myRLib.R’, so that the wrapper functions can be used:

> source(\"myRLib.R\") > sayHello(2) hello world! hello world! [[1]] [1] 2

Note that the ‘sayHello’ function is not particularly interesting since it does not return any value. The next example will illustrate how to accomplish this.

A second, more engaging example

Given R’s pervasive use of vectors, a simple example of a function that takes a vector of real numbers as input, and returns its components’ sum as output is shown next.

#include

/* sayHello part not shown */ void mySum(double *a, int* n, double *s) { int i; *s= 0.0; for (i=0; i< *n; i++) *s+= a[i]; }

Note that both ‘a’ and ‘s’ are declared as pointers, the former being used as the address of the first array element, the second as an address to store a double value, i.e., the sum of array’s compoments. To produce the shared library, it is build using the R appropriate command as before:

$ R CMD SHLIB myRLib.c

The wrapper code for this function is slightly more interesting since it will be programmed to provide a convenient "function-feel”.

5.4. Integrating code with software packages 83 VSC documentation Documentation, Release 1.0

dyn.load("myRLib.so");

# sayHello wrapper not shown

mySum <- function(a) { n <- length(a); result <- .C("mySum", as.double(a), as.integer(n), s = double(1)); result$s }

Note that the wrapper functions is now used to do some more work: 1. it preprocesses the input by calculating the length of the input vector; 2. it initializes ‘s’, the parameter that will be used in the C function to store the result in; and 3. it captures the result from the call to the C function which contains all parameters passed to the function, in the last statement only extracting the actual result of the computation. From R, ‘mySum’ can now easily be called:

> source("myRLib.R") > mySum(c(1,3,8)) [1] 12

Note that ‘mySum’ will probably not be faster than R’s own ‘sum’ function.

A last example

Function can return vectors as well, so this last example illustrates how to accomplish this. The library is extended to:

#include

/* sayHello and my_sum not shown */

void myMult(double *a, int *n, double *lambda, double *b) { int i; for (i=0; i< *n; i++) b[i]=( *lambda)*a[i]; }

The semantics of the function is simply to take a vector and a real number as input, and return a vector of which each component is the product of the corresponding component in the original vector with that real number. After building the shared libary as before, we can extend the wrapper script for this new function as follows:

dyn.load("myRLib.so");

# sayHello and mySum wrapper not shown

myMult <- function(a, lambda) { n <- length(a); result <- .C("myMult", as.double(a), as.integer(n), as.double(lambda), m = double(n)); result$m }

From within R, ‘myMult’ can be used as expected.

84 Chapter 5. Software development VSC documentation Documentation, Release 1.0

> source("myRLib.R") > myMult(c(1,3,8),9) [1]9 27 72 > mySum(myMult(c(1,3,8),9)) [1] 108

Further reading

Obviously, this text is just for the impatient. More in-depth documentation can be found on the nearest CRAN site.

5.4. Integrating code with software packages 85 VSC documentation Documentation, Release 1.0

86 Chapter 5. Software development CHAPTER 6

VSC hardware

6.1 Tier-2 hardware

6.1.1 UAntwerpen Tier-2 hardware

Vaughan hardware

Vaughan was installed in the summer of 2020. It is a NEC system consisting of 152 nodes with two 32-core AMD Epyc 7452 Rome generation CPUs connected through a HDR100 InfiniBand network. All nodes have 256 GB RAM. The nodes do not have a sizeable local disk.

Access restrictions

Access is available for faculty, students (master’s projects under faculty supervision), and researchers of the AUHA. The cluster is integrated in the VSC network and runs to a large extent the standard VSC software setup. With Vaughan we do switch away from the Torque/Moab combo to Slurm Workload Manager and we use native Slurm job scripts. Users are required to take the “transition to Vaughan and Slurm” course when the setup of Vaughan is ready. Vaughan is also available to all VSC-users, though we appreciate that you contact the UAntwerp support team so that we know why you want to use the cluster. Jobs can have a maximal execution wall time of 3 days (72 hours). Vaughan should only be used if you have large enough parallel jobs to or can otherwise sufficiently fill up all cores of a compute node. Other work should be done on Leibniz. The login nodes are freely available. Access to the job queues and compute nodes is currently restricted. Contact UAntwerp user support ([email protected]) if you are interested in being a test user. One requirement is that you have jobs large enough to fill a compute node.

Hardware details

• 152 compute nodes

87 VSC documentation Documentation, Release 1.0

– 2 AMD Epyc 7452 [email protected] GHz (Rome), 32 cores each – 256 GB RAM – 240 GB SSD local disk (for OS, should not be used as main scratch) • 2 login nodes – 2 AMD Epyc 7282 [email protected] GHz (Rome), 16 cores each – 256 GB RAM – 2x 480 GB SSD local disk The nodes are connected using an InfiniBand HDR100 network. They are logically organised in 3 islands. 2 islands have 36 nodes and one has 32 nodes. Storage is provided through the central UAntwerp storage system. More info on the storage system is available on the UAntwerpen storage page.

Login infrastructure

Direct login is possible to both login nodes. • From outside the VSC network: use the external interface names. Outside of Belgium, a VPN connection to the UAntwerp network is required. • From inside the VSC network (e.g., another VSC cluster): use the internal interface names.

External interface Internal interface Login generic login-vaughan.hpc.uantwerpen.be Login login1-vaughan.hpc.uantwerpen.be login1.vaughan.antwerpen.vsc ln1.vaughan.antwerpen.vsc

login2-vaughan.hpc.uantwerpen.be login2.vaughan.antwerpen.vsc ln2.vaughan.antwerpen.vsc

Available resources

Characteristics of the compute nodes

Since Vaughan is currently a homogeneous system with respect to CPU type, memory and interconnect, it is not needed to specify any properties. Vaughan is running Slurm Workload Manager as the resource manager and scheduler. We do not support the PBS compatibility layer but encourage users to develop proper Slurm job scripts as one can then fully exploit the Slurm features and enjoy the power of the srun command when starting processes. Make sure to read the following pages which give a lot of information on Slurm and how to convert your Torque scripts: • Local Slurm documentation • Important differences between Slurm and Torque • Converting PBS/Torque options to Slurm

88 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

Available partitions

When submitting a job with sbatch or using srun, you can choose to specify the partition your job is submitted to. This indicates the type of your job and imposes some restrictions, but may let your job start sooner. When the option is omitted, your job is submitted to the default partition (vaughan). The following partitions are available:

partition limits vaughan Default. Maximum wall time of 3 days. debug Maximum 2 nodes with a maximum wall time of 1 hour. short Maximum wall time of 6 hours, with priority boost.

Compiling for Vaughan

To compile code for Vaughan, all intel, foss and GCC modules can be used (the latter equivalent to foss but without MPI and the math libraries).

Optimization options for the Intel compilers

As the processors in Vaughan are made by AMD, there is no explicit support in the Intel compilers. However, by choosing the appropriate compiler options, the Intel compilers still produce very good code for Vaughan that will often beat code produced by GCC (certainly for Fortran codes as gfortran is a rather weak compiler). To optimize specifically for Vaughan, compile on one of the Vaughan login or compute nodes and combine the option -march=core-avx2 with either optimization level -O2 or -O3. For some codes, the additional optimizations at level -O3 actually produce slower code (often the case if the code contains many short loops). Note that if you forget these options, the default for the Intel compilers is to generate code at optimization level -O2 (which is pretty good) but for the Pentium 4 (-march=pentium4) which uses none of the new instructions and hence also none of the vector instructions introduced since 2005, which is pretty bad. Hence always specify -march=core-avx2 (or any of the equivalent architecture options specifically for Broadwell for specialists) when compiling code. The -x and -ax-based options don’t function properly on AMD processors. These options add CPU detection to the code, and whenever detecting AMD processors, binaries refuse to work or switch to code for the ancient Pentium 4 architecture. E.g., -xCORE-AVX2 is known to produce non-working code.

Optimization options for the GNU compilers

We suggest to use the newest GNU compilers available on the Vaughan (preferably version 9 or younger) as the support for AMD processors has improved a lot recently. Never use the default GNU compilers installed on the system, but always load one of the foss or GCC modules. To optimize for Vaughan, compile on one of the Vaughan login or compute nodes and combine either the option -march=native or -march=znver2 with either optimization level -O2 or -O3. In most cases, and especially for floating point intensive code, -O3 will be the preferred optimization level with the GNU compilers as it only activates vectorization at this level whereas the Intel compilers already offer vectorization at level -O2. If you really need to use GCC version prior to version 8, -march=znver2 is not yet available. On GCC 6 or 7, -march=znver1 is probably the best choice. However, avoid using GCC versions that are even older. Note that if you forget these options, the default for the GNU compilers is to generate unoptimized (level -O0) code for a very generic CPU (-march=x86-64) which doesn’t exploit the performance potential of the Vaughan CPUs

6.1. Tier-2 hardware 89 VSC documentation Documentation, Release 1.0 at all. Hence one should always specify an appropriate architecture (the -march flag) and appropriate optimization level (the -O flag) as explained in the previous paragraph.

Further documentation:

• Intel toolchains • FOSS toolchains (contains GCC)

Origin of the name

Vaughan is named after Dorothy Vaughan, an Afro-American mathematician who worked for NACA and NASA. During her 28-year career, Vaughan prepared for the introduction of machine computers in the early 1960s by teaching herself and her staff the programming language of Fortran. She later headed the programming section of the Analysis and Computation Division (ACD) at Langley.

Leibniz hardware

Leibniz was installed in the spring of 2017. It is a NEC system consisting of 152 nodes with 2 14-core Intel E5- 2680v4 Broadwell generation CPUs connected through a EDR InfiniBand network. 144 of these nodes have 128 GB RAM, the other 8 have 256 GB RAM. The nodes do not have a sizeable local disk. The cluster also contains a node for visualisation and 3 node types for experimenting with accelerator: 2 nodes for GPU computing (NVIDIA Pascal generation), one node with dual NEC SX-Aurora TSUBASA vector processors and one node with an Intel Xeon Phi expansion board.

Access restrictions

Access is available for faculty, students (master’s projects under faculty supervision), and researchers of the AUHA. The cluster is integrated in the VSC network and runs the standard VSC software setup. It is also available to all VSC-users, though we appreciate that you contact the UAntwerpen support team so that we know why you want to use the cluster. Jobs can have a maximal execution wall time of 3 days (72 hours), except on the “hopper” compute nodes of Leibniz were it is possible to submit 7 days jobs on request (motivation needed). Please also consider using the newer cluster Vaughan for big parallel jobs that can use 64 cores or multiples thereof as soon as that cluster becomes available. The login nodes and regular compute nodes are freely available. Contact UAntwerp user support ([email protected]) for access to the visualization node and accelerator nodes (free of charge but controlled access).

Hardware details

• 152 regular compute nodes – 2 Xeon E5-2680v4 [email protected] GHz (Broadwell), 14 cores each – 128 GB RAM (144 nodes) or 256 GB RAM (8 nodes) – 120 GB SSD local disk • 24 “hopper” compute nodes (recovered from the former Hopper cluster) – 2 Xeon E5-2680v2 [email protected] GHz (Ivy Bridge), 10 cores each

90 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

– 64 GB RAM (144 nodes) or 256 GB RAM (24 nodes) – 500 GB local disk – Instructions for the hopper compute nodes • 2 GPGPU nodes – 2 Xeon E5-2680v4 [email protected] GHz (Broadwell), 14 cores each – 128 GB RAM – 2 NVIDIA P100, 16 GB HBM2 – 120 GB SSD local disk – Instructions for the GPGPU nodes • 1 vector computing node (NEC SX-Aurora TSUBASA model A300-2) – 1 Xeon Gold 6126 [email protected] GHz (Skylake) with 12 cores – 96 GB RAM – 2 NEC SX-Aurora Vector Engines type 10B (per card 8 cores @1.4 GHz, 48 GB HBM2) – 240 GB SSD local disk – Instructions for the NEC SX-Aurora node • 2 login nodes – 2 Xeon E5-2680v4 [email protected] GHz (Broadwell), 14 cores each – 256 GB RAM – 120 GB SSD local disk • 1 visualization node – 2 Xeon E5-2680v4 [email protected] GHz (Broadwell), 14 cores each – 256 GB RAM – 1 NVIDIA Quadro P5000 – 120 GB SSD local disk – Instructions for the visualization node The nodes are connected using an InfiniBand EDR network except for the “hopper” compute nodes that utilize FDR10 InfiniBand. More info on the storage system is available on the UAntwerpen storage page.

Login infrastructure

Direct login is possible to both login nodes and to the visualization node. • From outside the VSC network: use the external interface names. Outside of Belgium, a VPN connection to the UAntwerp network is required. • From inside the VSC network (e.g., another VSC cluster): use the internal interface names.

6.1. Tier-2 hardware 91 VSC documentation Documentation, Release 1.0

External interface Internal interface Login generic login-leibniz.hpc.uantwerpen.be Login login1-leibniz.hpc.uantwerpen.be ln1.leibniz.antwerpen.vsc login1.leibniz.antwerpen.vsc login2-leibniz.hpc.uantwerpen.be ln2.leibniz.antwerpen.vsc login2.leibniz.antwerpen.vsc Visualisation node viz1-leibniz.hpc.uantwerpen.be viz1.leibniz.antwerpen.vsc

Characteristics of the compute nodes

To remain compatible with the typical VSC setup, a number of properties can be used in job scripts. However, only one is really useful in the current setup of leibniz to select the proper node type, mem256.

prop- explanation erty broad-only use Intel processors from the Broadwell family (E5-XXXXv4) (Not needed at the moment as this is well CPU type is selected automatically) ivy- only use Intel processors from the Ivy Bridge family (E5-XXXXv2) Not needed at the moment as there is bridge no automatic selection of the queue for the Ivy Bridge nodes. Specify -q hopper instead. gpu only use the GPGPU nodes of Leibniz. Not needed at the moment as there is no automatic selection of the queue for the GPGPU nodes at the moment. Specify -q gpu instead. ib use InfiniBand interconnect (Not needed at the moment as all nodes are connected to the InfiniBand inter- connect) mem128use nodes with 128 GB RAM (roughly 112 GB available). This is the majority of the nodes on Leibniz. Requesting this as a feature ensures that you get a node with 128 GB of memory and keep the nodes with more memory available for other users who really need that feature. mem256use nodes with 256 GB RAM (roughly 240 GB available). This property is useful if you submit a batch of jobs that require more than 4 GB of RAM per processor but do not use all cores and you do not want to use a tool to bundle jobs yourself such as Worker, as it helps the scheduler to put those jobs on nodes that can be further filled with your jobs.

Compiling for Leibniz

To compile code for Leibniz, all intel, foss and GCC modules can be used (the latter equivalent to foss but without MPI and the math libraries).

Optimization options for the Intel compilers

To optimize specifically for Leibniz, compile on one of the Leibniz login or compute nodes and combine the option -xHost with either optimization level -O2 or -O3. For some codes, the additional optimizations at level -O3 actually produce slower code (often the case if the code contains many short loops). Note that if you forget these options, the default for the Intel compilers is to generate code at optimization level -O2 (which is pretty good) but for the Pentium 4 (-march=pentium4) which uses none of the new instructions and hence also none of the vector instructions introduced since 2005, which is pretty bad. Hence always specify -xHost (or any of the equivalent architecture options specifically for Broadwell for specialists) when compiling code.

Optimization options for the GNU compilers

Never use the default GNU compilers installed on the system, but always load one of the foss or GCC modules.

92 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

To optimize for Leibniz, compile on one of the Leibniz login or compute nodes and combine either the option -march=native or -march=broadwell with either optimization level -O2 or -O3. In most cases, and es- pecially for floating point intensive code, -O3 will be the preferred optimization level with the GNU compilers as it only activates vectorization at this level whereas the Intel compilers already offer vectorization at level -O2. Note that if you forget these options, the default for the GNU compilers is to generate unoptimized (level -O0) code for a very generic CPU (-march=x86-64) which doesn’t exploit the performance potential of the Leibniz CPUs at all. Hence one should always specify an appropriate architecture (the -march flag) and appropriate optimization level (the -O flag) as explained in the previous paragraph.

Further documentation:

• Intel toolchains • FOSS toolchains (contains GCC)

Origin of the name

Leibniz is named after Gottfried Wilhelm Leibniz, a German multi-disciplinary scientist living in the late 17th and early 18th century. Leibniz may be best known as a developer of differential and integral calculus, independently of the work of Isaac Newton. But his contributions to science do not stop there. Leibniz also refined the binary number system, the foundation of nearly all modern computers. He also designed mechanical calculators on which one could do the four basic operations (add, subtract, multiply and divide). In all, Leibniz made contributions to philosophy, mathematics, physics and technology, and several other fields.

UAntwerpen storage

The storage is organised according to the VSC storage guidelines.

File systems

Variable Type Access Backup Default quota Total capacity $VSC_HOME NFS/XFS VSC YES 3 GB, 20000 3.5 TB files $VSC_DATA NFS/XFS VSC YES 25 GB, 60 TB 100000 files $VSC_SCRATCH BeeGFS leibniz, NO 50 GB 0.5 PB $VSC_SCRATCH_SITE vaughan $VSC_SCRATCH_NODE ext4 Node NO 100 GB lebiniz, 100 GB vaughan

• The home file system uses two mirrored SSD drives. It should only be used for all configuration files that Linux programs use. It should not be used to install software or to run jobs from. It is also not meant to be heavily written to, but offers excellent read performance. (Incremental) backups are made on a daily basis. • The data file system uses 4TB hard drives in a RAID6 configuration for redundancy in case of a disk failure. It uses the XFS file system exported to the nodes of the cluster over NFS. We make backups of data on this file system for as long as the daily volume remains small enough. The data file system should be used for: – Data that should be stored for a longer time since it is often reused on the cluster.

6.1. Tier-2 hardware 93 VSC documentation Documentation, Release 1.0

– It is the best place to install software you prefer to install yourself. For performance reasons we do advice users though to build on existing centrally installed software and not do a complete install of, e.g., Python or R if it can be build on the existing installation. – The data file system should not be used to store temporary data or run jobs on. We may however ask users who tend to generate many small files to use this file system instead of the scratch file system. Their directories might not go on backup though. • The central scratch file system is a parallel file system using BeeGFS. It has by far the largest capacity of all file systems at UAntwerpen and also provides much higher bandwidth then the data and home file systems. BeeGFS is a parallel supercomputer file system. This implies that it is optimized for large I/O transfers to large files. Note that BeeGFS does not support hard links. Hard links are not used a lot. Conda however uses them internally so installing software using conda on the scratch file system will not work. Note that the VSC storage is not meant for backup purposes. You should not rely on it to store all data of your PhD or postdoc. It is the individual user’s responsibility to store that data on reliable storage managed by your research group or the ICT service of the university.

Environment variables

Variable Name $VSC_HOME /user/antwerpen/20X/vsc20XYZ $VSC_DATA /data/antwerpen/20X/vsc20XYZ $VSC_SCRATCH $VSC_SCRATCH_SITE /scratch/antwerpen/20X/vsc20XYZ $VSC_SCRATCH_NODE /tmp

For users with non-vsc2XXXX-accounts, the path names will be similar to those above for UAntwerpen users with trivial modifications based on your home institution and VSC account number.

Requesting quota

• Users with a vsc2XXXX account: Additional quota can be requested via the UAntwerp support team at [email protected]. • Users with another VSC account: – Your scratch directory is not automatically created on the UAntwerp clusters. Please contact the UAntwerp support team at [email protected] to get a scratch directory on our systems if you plan to use them. You can use the same mail address to request additional quota on the scratch file system. – Your user and data directories are exported from your home institution which implies that you have to ask for additional quota on those file systems at your home institution.

Note

On the previous storage system, UAntwerpen had an additions volume /small that was slightly better optimized for storing lots of small files. There is no full equivalent in the current storage system. If the small files are the result of installing software yourself, $VSC_DATA is the ideal place to store those files. If they are the result of your jobs, we encourage you consider switching to better software or reworking your code to have a more supercomputer-friendly output pattern (or use databases such as SQLite3) and if that is impossible, to contact UAntwerpen user support at hpc!uantwerpen to discuss a possible solution and the best place to put your data.

94 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

UAntwerp-specific software instructions

Slurm Workload Manager

Vaughan runs Slurm Workload Manager as the resource manager and scheduler; Leibniz will be switched over at a later date. The Slurm documentation is work-in-progress as we are still refining the setup. • An overview of Slurm concepts and commands: – The Slurm Workload Manager @ UAntwerp page contains the minimum that we expect a user to know. – The Slurm @ UAntwerp: Advanced use page contains some more advanced commands and additional ways to request resources. Check those if you feel the basic information is not enough. • You can also have a look at our quick PBS-to-Slurm conversion tables which will help you convert your PBS job scripts to Slurm. • A list of important differences between Torque and Slurm

Software installed on the UAntwerp clusters

• The Intel toolchain is set up in slightly different way from other VSC sites since the 2017a version. • The 2020a toolchain family is our base toolchain for Vaughan. Software in older toolchains will not be made available on Vaughan unless experience shows a serious performance degradation when compiled in the newer toolchain. • Overview of licensed software at UAntwerp and instructions on how to get access. • Python on the UAntwerp clusters

Instructions for the special node types

• Using remote visualisation. VNC-based remote visualisation is fully supported on the visualisation node and partly supported on the regular login nodes. • Working on the GPU nodes • Vector computing on the NEC SX Aurora TSUBASA node • Using the hopper nodes in leibniz

6.1.2 VUB Tier-2 hardware

Hydra hardware

The VUB Hydra cluster contains a mix of nodes containing Intel processors with different CPU microarchitectures and different interconnects in different sections of the cluster. The cluster also contains a number of nodes with NVIDIA GPUs.

6.1. Tier-2 hardware 95 VSC documentation Documentation, Release 1.0

Hardware details

nodes processor mem- disk net- extra ory work 11 2x 10-core INTEL E5-2680v2 128 900 QDR- (ivybridge) GB GB IB 20 2x 10-core INTEL E5-2680v2 256 900 QDR- (ivybridge) GB GB IB 6 2x 10-core INTEL E5-2680v2 128 900 QDR- 2x Tesla K20Xm NVIDIA GPGPUs (ivybridge) GB GB IB 6Gb/node (kepler) 27 2x 14-core INTEL E5-2680v4 256 1 TB 10 (broadwell) GB Gbps 1 4x 10-core INTEL E7-8891v4 1.5 4 TB 10 (broadwell) TB Gbps 4 2x 12-core INTEL E5-2650v4 256 2 TB 10 2x Tesla P100 NVIDIA GPGPUs 16 (broadwell) GB Gbps Gb/node (pascal) 1 2x 16-core INTEL E5-2683v4 512 8 TB 10 4x GeForce 1080Ti NVIDIA GPUs 12 (broadwell) GB Gbps Gb/node (geforce) 22 2x 20-core INTEL Xeon Gold 192 1 TB 10 6148 (skylake) GB Gbps 31 2x 20-core INTEL Xeon Gold 192 1 TB EDR- 6148 (skylake) GB IB

Access restrictions

Access is available for faculty members, students (master’s projects under faculty supervision), and researchers of the VUB, as well as VSC users of other Flemish universities. The cluster is integrated in the VSC network and runs the standard VSC software setup. Jobs can have a maximal execution wall time of 5 days (120 hours).

Login infrastructure

Users with a VSC account (VSC-ID) can connect to Hydra via the following hostname: • @login.hpc.vub.be Hardware specs: • Intel Skylake (Xeon Gold 6126) - 24 cores in total (fair share between all users) • 96GB memory (maximum per user: 12GB) • 10GbE network connection • Infiniband EDR connection to the storage

User documentation

For documentation on Hydra usage, consult the documentation website: https://hpc.vub.be/docs/

96 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

For question or problems, contact the VUB HPC team: [email protected]

VUB storage

The storage is organized according to the VSC storage guidelines

Variable Type Access Backup Default quota $VSC_HOME GPFS VSC YES 6 GB $VSC_DATA GPFS VSC YES 50 GB $VSC_SCRATCH $VSC_SCRATCH_SITE GPFS Hydra NO 100 GB $VSC_SCRATCH_NODE ext4 Node NO (no quota)

For users from other universities, the quota on $VSC_HOME and $VSC_DATA are determined by the local pol- icy of your home institution as these file systems are mounted from there. The path names are similar with trivial modifications based on your home institution and VSC account.

Variable Name $VSC_HOME /user/brussel/10X/vsc10XYZ $VSC_DATA /data/brussel/10X/vsc10XYZ $VSC_SCRATCH $VSC_SCRATCH_SITE /scratch/brussel/10X/vsc10XYZ $VSC_SCRATCH_NODE /tmp/vsc10XYZ

6.1.3 HPC-UGent Tier-2 hardware

The Stevin computing infrastructure consists of several Tier2 clusters which are hosted in the S10 datacenter of Ghent University. This infrastructure is co-financed by FWO and Department of Economy, Science and Innovation (EWI).

Login nodes

Log in to the HPC-UGent infrastructure using SSH via login.hpc.ugent.be .

6.1. Tier-2 hardware 97 VSC documentation Documentation, Release 1.0

Compute clusters

clus- # Processor architecture Usable Local Inter- Oper- ter nodes mem- diskspace/nodecon- ating name ory/node nect sys- tem swalot 128 2 x 10-core Intel E5-2660v3(Haswell-EP 116 GiB 1 TB FDR Cen- @ 2.6 GHz) Infini- tOS Band 7 skitty 72 2 x 18-core Intel Xeon Gold 6140(Skylake 177 GiB 1 TB + EDR Cen- @ 2.3 GHz) 240GB Infini- tOS SSD Band 7 vic- 96 2 x 18-core Intel Xeon Gold 6140(Skylake 88 GiB 1 TB + 10 GbE Cen- tini* @ 2.3 GHz) 240 GB tOS SSD 7 joltik 10 2x 16-core Intel Xeon Gold 6242(Cascade 256 GiB 800 GB double Cen- Lake @ 2.8 GHz) + 4x NVIDIA Volta SSD EDR tOS V100 GPUs (32GB GPU memory) Infini- 7 band kir- 16 2 x 18-core Intel Xeon Gold 6240(Skylake 738 GiB 1.6 TB HDR- Cen- lia @ 2.6 GHz) NVME 100 tOS Infini- 7 Band do- 128 2x 48-core AMD EPYC 7552 (Rome @ 250 GiB 180 GB HDR- RHEL duo 2.2 GHz) SSD 100 8 Infini- Band

(*) default cluster For most recent information about the available resources and cluster status, please consult https://www.ugent.be/hpc/ en/infrastructure.

Shared storage

Filesystem Intended usage Total Personal VO stor- name storage storage age space space space (*) $VSC_HOME Home directory, entry point to the system 51 TB 3GB (fixed) (none) $VSC_DATA Long-term storage of large data files 1.8 PB 25GB (fixed) 250GB $VSC_SCRATCH Temporary fast storage of ‘live’ data for calcula- 1.9 PB 25GB (fixed) 250GB tions $VSC_SCRATCH_ARCANINETemporary very fast storage of ‘live’ data for calcu- 70 TB (none) upon lations (recommended for very I/O-intensive jobs) NVME request

(*) Storage space for a group of users (Virtual Organisation or VO for short) can be increased significantly on request. For more information, see our HPC-UGent tutorial: https://www.ugent.be/hpc/en/support/documentation.htm.

User documentation

Please consult https://www.ugent.be/hpc/en/support/documentation.htm.

98 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

In case of questions or problems, don’t hesitate to contact the HPC-UGent support team via [email protected], see also https://www.ugent.be/hpc/en/support.

6.1.4 KU Leuven/UHasselt Tier-2 hardware

Genius hardware

Genius is KU Leuven/UHasselt’s most recent Tier-2 cluster. It has thin nodes, large memory nodes, as well as GPGPU nodes.

Login infrastructure

Direct login using SSH is possible to all login infrastructure without restrictions. You can access Genius through: login-genius.hpc.kuleuven.be This will loadbalance your connection to one of the 4 Genius login nodes. Two types of login nodes are available: • classic login nodes, i.e., terminal SSH access: – login1-tier2.hpc.kuleuven.be – login2-tier2.hpc.kuleuven.be – login3-tier2.hpc.kuleuven.be – login4-tier2.hpc.kuleuven.be • login node that provides a that can be used for, e.g., visualization, see the NX clients section: – nx.hpc.kuleuven.be

Warning: This node should not be accessed using terminal SSH, it serves only as a gateway to the actual login nodes your NX sessions will be running on.

The NX login node will start a session on a login node that has a GPU, i.e., either

* login3-tier2.hpc.kuleuven.be * login4-tier2.hpc.kuleuven.be

Hardware details

• 230 thin nodes – 86 skylake nodes

* 2 Xeon Gold 6140 [email protected] GHz (Skylake), 18 cores each * 192 GB RAM (memory bandwidth and latency measurements) * 200 GB SSD local disk * feature skylake – 144 cascadelake nodes

* 2 Xeon Gold 6240 [email protected] GHz (Cascadelake), 18 cores each

6.1. Tier-2 hardware 99 VSC documentation Documentation, Release 1.0

* 192 GB RAM (memory bandwidth and latency measurements) * 200 GB SSD local disk * feature cascadelake • 10 big memory nodes – 2 Xeon Gold 6140 [email protected] GHz (Skylake), 18 cores each – 768 GB RAM – 200 GB SSD local disk – partition bigmem, specific qsub options apply. • 22 GPGPU nodes, 96 GPU devices – 20 P100 nodes

* 2 Xeon Gold 6140 [email protected] GHz (Skylake), 18 cores each * 192 GB RAM * 4 NVIDIA P100 [email protected] GHz, 16 GB GDDR, connected with NVLink * 200 GB SSD local disk * partition gpu, specific qsub options apply. – 2 V100 nodes

* 2 Xeon Gold 6240 [email protected] GHz (Cascadelake), 18 cores each * 768 GB RAM * 8 NVIDIA V100 [email protected] GHz, 32 GB GDDR, connected with NVLink * 200 GB SSD local disk * partition gpu, specific qsub options apply. • 4 AMD nodes – 2 EPYC 7501 [email protected] GHz, 32 cores each – 256 GB RAM – 200 GB SSD local disk – partition amd, specific qsub options apply. The nodes are connected using an Infiniband EDR network (bandwidth 25 Gb/s), the islands are indicated on the diagram below.

100 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

Superdome hardware

Superdome is a shared memory machine and consists of several nodes.

Login infrastructure

Direct login using SSH is possible to all login infrastructure without restrictions. You can access Genius through: login-genius.hpc.kuleuven.be This will loadbalance your connection to one of the 4 Genius login nodes. Two types of login nodes are available: • classic login nodes, i.e., terminal SSH access: – login1-tier2.hpc.kuleuven.be

6.1. Tier-2 hardware 101 VSC documentation Documentation, Release 1.0

– login2-tier2.hpc.kuleuven.be – login3-tier2.hpc.kuleuven.be – login4-tier2.hpc.kuleuven.be • login node that provides a desktop environment that can be used for, e.g., visualization, see the NX clients section: – nx.hpc.kuleuven.be

Warning: This node should not be accessed using terminal SSH, it serves only as a gateway to the actual login nodes your NX sessions will be running on.

The NX login node will start a session on a login node that has a GPU, i.e., either

* login3-tier2.hpc.kuleuven.be * login4-tier2.hpc.kuleuven.be

Hardware details

• 8 nodes – Xeon Gold 6132 [email protected] GHz (Skylake), 14 cores each – 750 GB RAM – partition superdome So in total Superdome has 6TB RAM memory and additionally a local shared scratch file system of 6TB. A quick start guide is available to get you started on submitting jobs to the Superdome.

KU Leuven storage

The storage is organized according to the VSC storage guidelines

Variable Type Access Backup Default quota $VSC_HOME NFS VSC YES 3 GB $VSC_DATA NFS VSC YES 75 GB $VSC_SCRATCH $VSC_SCRATCH_SITE Lustre genius NO 500 GB $VSC_SCRATCH_NODE ext4 genius, job only NO 200 GB $VSC_SCRATCH_JOB BeeGFS genius, job only NO variable

For users from other universities, the quota on $VSC_HOME and $VSC_DATA will be determined by the local policy of your home institution as these file systems are mounted from there. The path names will be similar with trivial modifications based on your home institution and VSC account number.

Variable Name $VSC_HOME /user/leuven/30X/vsc30XYZ $VSC_DATA /data/leuven/30X/vsc30XYZ $VSC_SCRATCH $VSC_SCRATCH_SITE /scratch/leuven/30X/vsc30XYZ $VSC_SCRATCH_NODE /local_scratch

102 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

The $VSC_HOME and $VSC_DATA file systems have snapshots, so it is possible to recover data that was accidentally deleted or modified. You can retrieve data by restoring a snapshot. On genius, a BeeGFS file system can be created during the run of a job, for details, see the page on using BeeGFS for details.

6.2 Tier-1 hardware

6.2.1 Breniac hardware

Breniac is the VSC’s Tier-1 cluster.

Breniac login infrastructure

Direct login using SSH is possible to all login infrastructure.

Note: Only users involced in an active Tier-1 project have access to the infrastructure.

Two types of login nodes are available: • a classic login node, i.e., SSH access. – login1-tier1.hpc.kuleuven.be – login2-tier1.hpc.kuleuven.be • a login node that provides a desktop environment that can be used for, e.g., visualization, see the NX clients section: – nx-tier1.hpc.kuleuven.be

Warning: This node should not be accessed using terminal SSH, it serves only as a gateway to the actual login nodes your NX sessions will be running on.

The NX login node will start a session on a login node that has a GPU, i.e., login2-tier1.hpc. kuleuven.be.

Hardware details

• 408 skylake nodes – 2 Xeon Gold 6132 [email protected], 14 cores each – 192 GB RAM (memory bandwith and latency measurements) – 75 GB SSD local disk – specific qsub options apply • 436 broadwell nodes – 2 Xeon E5-2680v4 [email protected], 14 cores each – 128 GB RAM (memory bandwith and latency measurements)

6.2. Tier-1 hardware 103 VSC documentation Documentation, Release 1.0

– 75 GB SSD local disk – specific qsub options apply • 144 broadwell nodes – 2 Xeon E5-2680v4 [email protected], 14 cores each – 256 GB RAM – 75 GB SSD local disk – specific qsub options apply The nodes are connected using an Infiniband EDR network.

Breniac storage

Your $VSC_HOME and $VSC_DATA directory are mounted on the Breniac login and compute nodes. See your VSC institute’s information on local storage about policies and quota. Also check the VSC storage guidelines for information on what to store where.

Note: $VSC_HOME and $VSC_DATA are mounted using NFS, so they can not be used for parallel I/O. If your software benefits from using a parallel file system, please use $VSC_SCRATCH.

Variable Type Access Backup Default quota $VSC_SCRATCH $VSC_SCRATCH_SITE GPFS breniac NO 1 TB $VSC_SCRATCH_NODE ext4 breniac, job only NO 75 GB

104 Chapter 6. VSC hardware VSC documentation Documentation, Release 1.0

The path names given below should be adapted to reflect your home institution and VSC account number.

Variable Name $VSC_HOME /user/leuven/30X/vsc30XYZ $VSC_DATA /data/leuven/30X/vsc30XYZ $VSC_SCRATCH $VSC_SCRATCH_SITE /scratch/leuven/30X/vsc30XYZ $VSC_SCRATCH_NODE /local_scratch

6.2. Tier-1 hardware 105 VSC documentation Documentation, Release 1.0

106 Chapter 6. VSC hardware CHAPTER 7

Globus file and data sharing platform

7.1 What is Globus

The Globus platform enables developers to provide robust file transfer, sharing and search capabilities within their own research data applications and services, while leveraging advanced identity management, single sign-on, and authorization capabilities. This document is a hands-on guide to the Globus file sharing platform. It should complement the official documenta- tion at docs.globus.org. For all questions concerning the Globus file sharing platform, please contact the VSC Globus team at [email protected]. We welcome your feedback, comments and suggestions for improving the Globus Tutorial.

7.2 Access

Log in with your institution: Visit www.globus.org and click Login at the top of the page. On the Globus login page, choose an organization you are already registered with, such as your school or your employer.

107 VSC documentation Documentation, Release 1.0

When you find it, click Continue. If you cannot find your organization in the list please contact the support team at [email protected]. You will be redirected to your organization’s login page. Use your credentials for that organization to login. If that is your first time logging into Globus some organizations may ask for your permission to release your account information to Globus. In most cases that would be a one-time request. Once you have logged in with your organization, Globus will ask if you would like to link to an existing account. If this is your first time logging in to Globus, click Continue. If you have already used another account with Globus, you can choose Link to an existing account. You may be prompted to provide additional information such as your organization and whether or not Globus will be used for commercial purposes. Complete the form and click Continue. Finally, you need to give Globus permission to use your identity to access information and perform actions (like file transfers) on your behalf.

108 Chapter 7. Globus file and data sharing platform VSC documentation Documentation, Release 1.0

7.3 Managing and transferring files

7.3.1 The File Manager

After you have logged in to Globus, you will begin at the File Manager. The first time you use the File Manager, all fields will be blank.

7.3. Managing and transferring files 109 VSC documentation Documentation, Release 1.0

7.3.2 Globus Collections and Endpoints

Within Globus user data resides in a location called Globus Collection. Globus Collections can be hosted on various storage and computer systems, e.g., HPC clusters, laptops, or cloud storage. To use a collection you only need its name. It is not necessary to know any details about the storage or system where that collection is being hosted. An Endpoint is a server that hosts collections. If you want to be able to access, share, transfer, or manage data using Globus, the first step is to create an endpoint on the system where the data is (or will be) stored. To access a Globus Collection: • Click in the Collection field at the top of the File Manager page and search all available collections and endpoints by typing a collection/endpoint name or a description. Globus will list collections with matching names.

110 Chapter 7. Globus file and data sharing platform VSC documentation Documentation, Release 1.0

Note: At the VSC all Tier-1 and Tier-2 systems have dedicated Globus Collections. Type vsc in the search field to see the list.

• Click on a collection. Globus will connect to the collection and display the default directory. Navigate either by typing the destination directory into the Path field, or browse the available directories below.

7.3.3 Transferring files

VSC users can use Globus to transfer files and data between collections they have access to. Those can be, e.g., own VSC /scratch and /data directories on Tier-1 and Tier-2, or any other remote and local server collection. To transfer data between two collections: • Click Transfer or Sync to. . . in the command panel on the right side of the page. A new collection panel will open, with a Transfer or Sync to field at the top of the panel. Alternatively, use the Panels toggle button in the top right corner to split the directory tree panel and select a destination collection. • Find and choose the second collection and connect to it as you did with the first one. Click on the left first collection and select all the files to transfer. The Start button at the bottom of the panel will become active. Between the two Start buttons at the bottom of the page, the Transfer and Sync Options tab provides access to several options. By default, Globus verifies file integrity after transfer using checksums. Change the transfer settings if you would like. You may also enter a label for the transfer. • Click the Start button to transfer the selected files to the collection in the right panel. Globus will display a green notification panel - confirming that the transfer request was submitted - and add a badge to the Activity item in the command menu on the left of the page. You can navigate away from the File Manager, close the browser window, and even logout. Globus will optimize the transfer for performance, monitor the transfer for completion and correctness, and recover from network errors and collection downtime.

7.3. Managing and transferring files 111 VSC documentation Documentation, Release 1.0

Completed file transfers can be seen in the Activity tab in the command menu on the left of the page. On the Activity page, click the arrow icon on the right to view details about the transfer. You will also receive an email with the transfer details.

7.3.4 Management Console

File transfers can also be monitored via the Globus Management Console.

7.4 Local endpoints

Users can create and manage endpoints on their personal computers via Globus Connect Personal. Globus Connect Personal is available for all major operating systems. Download the software and follow the installation instructions. During the setup process you will be asked to set the local endpoint. Each endpoint has a unique ID. Those ones created or managed by you are associated with your Globus account. To access and manage your local endpoints: • Start the Globus Connect Personal applet. • From a browser log into Globus and in the File Manager navigate to Endpoints. Then select the Administered by you tab. • To change the attributes of an endpoint click on its name. You can adjust a variety of settings like private or public, endpoint name, contact info, and encryption.

112 Chapter 7. Globus file and data sharing platform VSC documentation Documentation, Release 1.0

Warning: Users should use caution and carefully select the privacy permissions and attributes when creating endpoints on their personal computers.

7.5 Data sharing

IMPORTANT! Currently, data can be shared with others only if the Globus endpoint is one of the user’s own managed endpoints. To share data, you will create a guest collection and grant your collaborators access as described in the instructions below. If you like, you can designate other Globus users as access managers for the guest collection, allowing them to grant or revoke access privileges for other Globus users.

• Login to Globus and navigate to File Manager. • Select the collection that has the files/folders you wish to share and, if necessary, activate the collection.

7.5. Data sharing 113 VSC documentation Documentation, Release 1.0

• Highlight the folder that you would like to share and click Share in the right command panel. • If Share is not available, contact the endpoint’s administrator or refer to Globus Connect Server Installation Guide for instructions on enabling sharing. If you are a using a Globus Connect Personal endpoint and you are a Globus Plus user, enable sharing by opening the Preferences for Globus Connect Personal, clicking the Access tab, and checking the Sharable box. • Provide a name for the guest collection, and click Create Share. • When your collection is created, you will be taken to the Sharing tab, where you can set permissions. As shown below, the starting permissions give read and write access (and the Administrator role) to the person who created the collection. Click the Add Permissions button or icon to share access with others. You can add permissions for an individual user, for a group, or for all logged-in users. In the Identity/E-mail field, type a person’s name or username (if user is selected) or a group name (if group is selected) and press Enter. Globus will display matching identities. Pick from the list. If the user has not used Globus before or you only have an email address, enter the email address and click Add. The users you share with will receive an email notification containing a link to the shared endpoint. You may add a customized message to this email. If you do not want to send a notification, uncheck the Send E-mail checkbox. You can add permissions to subfolders by entering a path in the Path field. • After receiving the email notification, your colleague can click on the link to log into Globus and access the guest collection. • You can allow others to manage the permissions for a collection you create. Use the Roles tab to manage roles for other users. You can assign roles to individual users or to groups. As shown below, the default is for the person who created the collection to have the Administrator role. The Access Manager role grants the ability to manage permissions for a collection. (Users with this role automatically have read/write access for the collection.) When a role is assigned to a group, all members of the group have the assigned role.

Warning: Be careful whom you give permissions to when sharing data.

7.6 Manage Globus groups

Globus allows you to create and manage groups of Globus users and share files and folders with these groups. More information about Globus groups can be found on the Globus Groups How-To page.

114 Chapter 7. Globus file and data sharing platform VSC documentation Documentation, Release 1.0

• Click Groups in the left-side command pane to open the Your Groups page. You will see a list of all the groups you are a member of, including those you administer or manage. To search for a group you belong to, type part of its name in the Filter groups field above the list. • Click Create new group to create a group. • Type a name for the group. You can also enter a description that tells prospective members about the group. Then click Create Group. • After creating a group, Globus will return you to the Your Groups page. Click the right arrow next to the newly created group. • Click Invite Others to invite other users to the group. You can invite others to the group by entering email addresses or by searching for Globus identities. Globus will send each person an email that includes a link for accepting the invitation to the group. – Invite by email address: Enter the email address for the person you wish to invite and click Add. This is a good option to use for members who do not yet have a Globus account. – Invite by Globus identity: Enter all or part of the person’s name or email address and press Enter to search for a current Globus identity. Select the user you wish to invite from the search results. • Click the Members tab to view users who have been invited or who are already members of the group. The Status field shows the membership status of each user, and the Role field shows each user’s role in the group. The Administrator in the group can change any user’s role or status in the group, including removing a user, by clicking the right arrow next to the user’s name. • Click the Settings tab to view the group’s settings and policies. These policies control who can see the group and its membership list, how new users are added to the group, and related privileges. Click Edit Policies to change any of these policies.

7.6. Manage Globus groups 115 VSC documentation Documentation, Release 1.0

7.7 Command Line Interface (CLI)

In addition to the web interface, also a set of Globus CLI tools are available. These are useful for automating your transfer processes, including the scheduling of recurrent transfers.

7.7.1 Getting started with the CLI

The Globus CLI documentation. contains an adequate introduction to the CLI. The following only includes VSC- specific examples and tips. The latest release can be locally installed in e.g. a Python Virtual Environment as follows:

$ /usr/bin/python3 -m venv venv_globus $ source ./venv_globus/bin/activate $ pip install globus-cli

One of the dependencies is currently incompatible with Python3’s ASCII encoding, requiring a change in the locale settings:

$ export LC_ALL=en_US.utf-8 && export LANG=en_US.utf-8

You can now authenticate by executing:

$ globus login and logging in via the generated URL, which will provide an access token. The available VSC endpoints can be then listed as follows:

$ globus endpoint search "VSC Tier" ID | Owner | Display Name ------| ------| ------

˓→----- 4f9698ae-5644-11eb-a45c-0e095b4c2e55 | [email protected] | VSC KU Leuven

˓→Tier1 2e1e56a4-3faa-11eb-b185-0ee0d5d9299f | [email protected] | VSC KU Leuven

˓→Tier2 ff4d98be-5c46-11e9-a623-0a54e005f950 | [email protected] | VSC UAntwerpen

˓→Tier2 bc2900d0-516c-11e9-bf30-0edbf3a4e7ee | [email protected] | VSC UGent Tier2 c62f6838-88f5-11e9-b807-0a37f382de32 | [email protected] | VSC VUB Tier2

To transfer a file from e.g. the KU Leuven Tier2 scratch to the Tier1 scratch:

$ endpoint_src=2e1e56a4-3faa-11eb-b185-0ee0d5d9299f $ endpoint_dest=4f9698ae-5644-11eb-a45c-0e095b4c2e55 $ globus transfer $endpoint_src:$VSC_SCRATCH/testfile.tgz \ $endpoint_dest:$VSC_SCRATCH/testfile.tgz --label "test_transfer" Message: The transfer has been accepted and a task has been created and queued for

˓→execution Task ID: 335cffea-9ef7-42dc-aac2-11cc312c3e51

Many more examples can be found in the Globus CLI documentation.

116 Chapter 7. Globus file and data sharing platform VSC documentation Documentation, Release 1.0

7.7.2 Scheduling transfers with Globus-Timer-CLI

Transfers can also be scheduled to happen in the future (instead of being executed immediately as in the example above) and to be repeated on a regular basis. These services are enabled by the Globus Timer API. The corresponding CLI tools can be installed in your Python (virtual) environment using Pip (Globus-Timer-CLI on PyPi). The above example can then be made to start at a certain point in the future and to be repeated every 7 days:

$ globus-timer job transfer \ --name test_transfer_weekly \ --label "test_transfer_weekly" \ --interval 604800 \ --start '2021-06-02T11:00:00' \ --source-endpoint $endpoint_src \ --dest-endpoint $endpoint_dest \ --item $VSC_SCRATCH/testfile.tgz $VSC_SCRATCH/testfile.tgz false Name: test_transfer Job ID: 6aa3ca88-8ae4-442d-99da-c5a89821c299 Status: new Start: 2021-06-02T09:00:00+00:00 Interval: 7 days, 0:00:00 Next Run At: 2021-06-09T09:00:00+00:00 Last Run Result: NOT RUN

For more information on the available transfer options and monitoring tools, please consult Globus-Timer-CLI on PyPi.

Warning: Times in the globus-timer outputs are always in UTC time, not in our local (CEST) time.

7.8 Glossary

Globus Collection A Globus Collection is a named location containing data you can access with Globus. Collections can be hosted on many different kinds of systems, including campus storage, HPC clusters, laptops, Amazon S3 buckets, Google Drive, and scientific instruments. When you use Globus, you don’t need to know a physical location or details about storage. You only need a collection name. A collection allows authorized Globus users to browse and transfer files. Globus Collections can also be used for sharing data with others, for data publication, and for enabling discovery by other Globus users. Globus Connect Globus Connect is used to host collections. Endpoint An endpoint is a server that hosts Globus Collections. If you want to be able to access, share, transfer, or manage data using Globus, the first step is to create an endpoint on the system where the data is (or will be) stored. An endpoint can be a laptop, a personal desktop system, a laboratory server, a campus data storage service, a cloud service, or an HPC cluster. Globus Globus is a data sharing and data transfer platform. See also: The official Globus How-To pages.

7.8. Glossary 117 VSC documentation Documentation, Release 1.0

118 Chapter 7. Globus file and data sharing platform CHAPTER 8

Frequently asked questions (FAQs)

8.1 General questions

8.1.1 User support

Each of the major VSC-institutions has its own user support: • KU Leuven/Hasselt University: [email protected] • Ghent University: [email protected], for further info, see the web site • University of Antwerp: [email protected], for further info on the CalcUA Core Facility web page • Vrije Universiteit Brussel: [email protected], see also our website for VUB-HPC specific info.

What information should I provide when contacting user support?

When you submit a support request, it helps if you always provide: 1. your VSC user ID (or VUB netID), 2. contact information - it helps to specify your preferred mail address and phone number for contact, 3. an informative subject line for your request, 4. the time the problem occurred, 5. the steps you took to resolve the problem.

Warning: If you are working in a terminal, please do not provide a screenshot, simply copy/paste the text in your terminal. This is easier to read for support staff, and allows them to copy/paste information, rather than having to retype, e.g., paths.

119 VSC documentation Documentation, Release 1.0

Below, you will find more useful information you can provide for various categories of problems you may encounter. Although it may seem like more work to you, it will often save a few iterations and get your problem solved faster.

If you have problems logging in to the system

Please provide the following information: 1. your operating system (e.g., Linux, Windows, macOS, . . . ), 2. your client software (e.g., PuTTY, OpenSSH, . . . ), 3. your location (e.g., on campus, at home, abroad), 4. whether the problem is systematic (how many times did you try, over which period) or intermittent, 5. any error messages shown by the client software, or an error log if it is available. Logging information can easily be obtained from the OpenSSH client, PuTTY and MobaXTerm..

If installed software malfunctions/crashes

Please provide the following information: 1. the name of the cluster your are working on, 2. the name of the application (e.g., Ansys, Matlab, R, . . . ), 3. the module(s) you load to use the software (e.g., R/3.1.2-intel-2015a), 4. the error message the application produces, as well as the location of the output and error files if this ran as a job, 5. whether the error is reproducible, 6. if possible, a procedure and data to reproduce the problem, 7. if the application was run as a job, the jobID(s) of (un)successful runs.

If your own software malfunctions/crashes

Please provide the following information: 1. the location of the source code, 2. the error message produced at build time or runtime, as well as the location of the output and error files if this ran as a job, 3. the toolchain and other module(s) you load to build the software (e.g., intel/2015a with HDF5/1.8.4-intel-2015a), 4. if possible and applicable, a procedure and data to reproduce the problem, 5. if the software was run as a job, the jobID(s) of (un)successful runs.

8.1.2 How do I acknowledge the VSC in publications?

When using the VSC-infrastructure for your research, you must acknowledge the VSC in all relevant publications. This will help the VSC secure funding, and hence you will benefit from it in the long run as well. It is also a contractual obligation for the VSC. • Please use the following phrase to do so in Dutch:

120 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

De infrastructuur en dienstverlening gebruikt in dit werk werd voorzien door het VSC (Vlaams Su- percomputer Centrum), gefinancierd door het FWO en de Vlaamse overheid. • or in English: The resources and services used in this work were provided by the VSC (Flemish Supercomputer Center), funded by the Research Foundation - Flanders (FWO) and the Flemish Government. Moreover, if you are in the KU Leuven association, you are also requested to add the relevant papers to the virtual collection “High Performance Computing” in Lirias so that we can easily generate the publication lists with relevant publications.

8.1.3 What are standard terms used in HPC?

HPC cluster A relatively tightly coupled collection of compute nodes, the interconnect typically allows for high bandwidth, low latency communication. Access to the cluster is provided through a login node. A resource manager and scheduler provide the logic to schedule jobs efficiently on the cluster. A detailed description of the VSC clusters and other hardware is available. Compute node An individual computer, part of an HPC cluster. Currently most compute nodes have two sockets, each with a single CPU, volatile working memory (RAM), a hard drive, typically small, and only used to store temporary files, and a network card. The hardware specifications for the various VSC compute nodes is available. CPU Central Processing Unit, the chip that performs the actual computation in a compute node. A modern CPU is composed of numerous cores, typically 10 to 36. It has also several cache levels that help in data reuse. Core Part of a modern CPU, a core is capable of running processes, and has its own processing logic and floating point unit. Each core has its own level 1 and level 2 cache for data and instructions. Cores share last level cache. Cache A relatively small amount of (very) fast volatile memory (when compared to regular RAM), on the CPU chip. A modern CPU has three cache level, L1 and L2 are specific to each core, while L3 (also referred to as Last Level Cache, LLC) is shared among all the cores of a CPU. RAM Random Access Memory used as working memory for the CPUs. On current hardware, the size of RAM is expressed in gigabytes (GB). The RAM is shared between the two CPUs on each of the sockets. This is volatile memory in the sense that once the process that creates the data ends, the data in the RAM is no longer available. The complete RAM can be accessed by each core. Walltime The actual time an application runs (as in clock on the wall), or is expected to run. When submitting a job, the walltime refers to the maximum time the application can run, i.e., the requested walltime. For accounting purposes, the walltime is the time the application actually ran, typically less than the requested walltime. Node-hour Unit of work indicating that an application ran for a time t on n nodes, such that n*t = 1 hour. Using 1 node for 1 hour is 1 node-hour. This is irrespective of the number of cores on the node you actually use. Node-day Unit of work indicating that an application ran for a time t on n nodes such that n*t = 24 hours. Using 3 nodes for 8 hours results in 1 node day. Core-hour Unit of work indicating that an application ran for a time t on p cores, such that p*t = 1 hour. Using 20 cores, no matter on how many nodes, for 1 hour results in 20 core-hours. Memory requirement The amount of RAM required to successfully run an application. It can be specified per process for a distributed application, expressed in GB. Storage requirement The amount of disk space required to store the input and output of an application, expressed in GB or TB. Temporary storage requirement The amount of disk space needed to store temporary files during the run of an application, expressed in GB or TB.

8.1. General questions 121 VSC documentation Documentation, Release 1.0

Single user per node policy Indicates that when a process of user A runs on a compute node, no process of another user will run on that compute node concurrently, i.e., the compute node will be exclusive to user A. However, if one or more processes of user A are running on a compute node, and that node’s capacity in terms of available cores and memory is not exceeded, processes that are part of another job submitted by user A may start on that compute node. Single job per node policy Indicates that when a process of a job is running on a compute node, no other job will concurrently run on that node, regardless of the resource that still remain available. Serial application A program that runs a single process, with a single thread. All computations are done sequentially, i.e., one after the other, no explicit parallelism is used. Shared memory application An application that uses multiple threads for its computations, ideally concurrently executed on multiple cores, one per thread. Each thread has access to the application’s global memory space (hence the name), and has some thread-private memory. A shared memory application runs on a single compute node. Such an application is also referred to as a multi-core or a multi-threaded application. Threads A process can concurrently perform multiple computations, i.e., program flows. In scientific applications, threads typically process their own subset of data, or a subset of loop iterations. OpenMP A standard for shared memory programming using C/C++/Fortran that makes abstraction of explicit threads. OpenMP is widely used for scientific programming. Distributed application An application that uses multiple processes. The application’s processes can run on multiple compute nodes. These processes communicate by exchanging messages, typically implemented by calls to an MPI library. Messages can be used to exchange data and coordinate the execution. Process An independent computation running on a computer. It may interact with other processes, and it may run multiple threads. A serial and shared memory application run as a single process, while a distributed application consists of multiple, coordinated processes. MPI Message Passing Interface, a de-facto standard that defines functions for inter-process communication. Many implementations in the form of libraries exist for C/C++/Fortran, some vendor specific. GPU A Graphical Processing Unit is a hardware component specifically designed to perform graphics related tasks efficiently. GPUs have been pressed into service for scientific computing. A compute node can be equipped with multiple GPUs. Software has to be designed specifically to use GPUs, and for scientific computing, CUDA and OpenACC are the most popular programming paradigms. GPGPU General Purpose computing on Graphical Processing Units refers to using graphic accelerators for non- graphics related tasks such as scientific computing. CUDA Compute Unified Device Architecture, an extension to the C programming language to develop software that can use GPU for computations. CUDA application run exclusively on NVIDIA hardware. OpenACC Open ACCelerators is a standard for developing C/C++/Fortran applications that can use GPUs for general purpose computing. OpenACC is mainly targeted to scientific computing.

8.2 Access to the infrastructure

8.2.1 Messed up keys

You can fix this yourself in a few easy steps via the VSC account page. There are two ways in which you may have messed up your keys: 1. The keys that were stored in the .ssh subdirectory of your home directory on the cluster were accidentally deleted, or the authorized_keys file was accidentally deleted.

122 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

1. Go to the VSC account page. 2. Choose your institute and log in. 3. At the top of the page, click ‘Edit Account’. 4. Press the ‘Update’ button on that web page. 5. Exercise some patience, within 30 minutes, your account should be accessible again. 2. You deleted your (private) keys on your own computer, or don’t remember the passphrase. 1. Generate a new public/private key pair. Follow the procedure outlined in the client sections for Linux, Windows and macOS (formerly OS X). 2. Go to the VSC account page. 3. Choose your institute and log in. 4. At the top of the page, click ‘Edit Account’. 5. Upload your new public key adding it in the ‘Add Public Key’ section of the page. Use ‘Browse. . . ’ to find your public key, press ‘Add’ to upload it. 6. You may now delete the entry for the “lost” key if you know which one that is, but this is not crucial. 7. Exercise some patience, within 30 minutes, your account should be accessible again.

8.3 Running jobs

8.3.1 How can I run many similar computations conveniently?

It is often necessary to run the same application on many input files, or with many different parameter values. You can of course manage such jobs by hand, or write scripts to do that for you. However, this is a very common scenario, so we developed software to do that for you. Two general purpose software packages are available: • the worker framework (worker quick start and worker documentation) and • atools(atools documentation). Both are designed to handle this use case, but each has its own strengths and weaknesses.

What features do atools and worker have?

Both software packages will do the bookkeeping for you, i.e., • keep track of the computations that were completed; • monitor the progress of a running job; • allow you to resume computations in case you underestimated the walltime requirements; • provide an overview of the computations that succeed, failed, or were not completed; • aggregate output of individual computations; • analyze the efficiency of a finished job.

8.3. Running jobs 123 VSC documentation Documentation, Release 1.0

Both software packages have been designed with simplicity in mind, one of the design goals is to make them as easy to use as possible. For a detailed overview of the features, see the atools documentation and the worker documentation.

What to use: atools or worker?

That depends on a number of issues, and you have to consider them all to make the correct choice. • Is an individual computation sequential, multi-threaded or even distributed? • How much walltime is required by an individual computation? • How many individual computations do you need to perform? • What are the job policies of the cluster you want to run on?

Type of computation

In worker and atools terminology, an individual computation is referred to as a work item. Depending on the imple- mentation of the work item, worker or atools may be a better match. The following table summarizes this.

work item type worker atools sequential yes yes multi-threaded yes yes MPI-based no yes

Warning: Although this might seems to suggest that since atools can deal with all types of work items, it is the best choice, this is definitely not true.

The table makes it clear that MPI applications can not be used in work items for worker. worker itself is implemented using MPI, and hence things would get terminally confused if it executes work items that contain calls to the MPI API.

Walltime per work item

When work items take only a short time to complete, the overhead for starting new work items will be considerable for atools since it relies on the scheduler to start individual work items. This is much more efficient for worker since all work items are executed by a single job, so the scheduler is not involved. On the other side of the spectrum, i.e., work items that take a very long time complete, atools may be the better choice since work items are executed independently. This however depends on the reliability of the infrastructure. The following table summarizes this.

single work item walltime worker atools < 1 second - – < 1 minute + - 1 minute to 24 hours ++ ++ > 24 hours + ++

124 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

Number of work items

If you need to do many individual computations (work items), say more than 500, worker is the better choice. It will be run as a single job, rather than many individual jobs, hence lightening the load on the scheduler considerably.

Job policies

The following job policies are currently in effect on various VSC clusters. Shared Multiple jobs from multiple users are allowed to run concurrently on a node. Single user Multiple jobs from a single user are allowed to run concurrently on a node. Single job Only a single job can run on a compute node. On some clusters, credits are required to run jobs, and that policy may also influence your choice. The table below provides an overview of the policies in effect on the various cluster/partitions.

VSC hub cluster partition policy accounting Antwerp any any single user no Brussels any any shared no Ghent any any shared no Leuven genius default shared yes Leuven genius bigmem shared yes Leuven genius gpu shared yes Leuven genius superdome shared yes Leuven breniac (Tier-1) default single job yes

Clusters with accounting enabled: If you use atools on a cluster where accounting is active, make sure a work item uses all resources of that node. If multiple work items run on the same node concurrently, you will be charged for each work item individually, making that a very expensive computation. In this situation, use worker. Clusters with single user policy: Ensure that the load balance is as good as possible. If a few work items require much more time than others, they may block the nodes they are running on from running other jobs. This is the case for both atools and worker. However, since worker is an MPI application, it will keep all nodes involved in the job blocked, aggravating the problem. Clusters with shared policy Here atools allows the scheduler the most flexibility, but keep in mind the considerations on work item walltime and the number of work items.

8.3.2 Can I run workflows on an HPC cluster?

Short answer: yes, you can. Some workflows do not lend themselves to be completed in a single job. As a running example, consider the following simple workflow: 1. preprocess first data file The data in file input1.dat has to be preprocessed using an application preprocess that runs only on a single core, and takes 1 hour to complete. The processed data is written to a file preprocessed1.dat. 2. main computation on first preprocessed file The preprocessed data in preprocessed1.dat is now used as input to run a large scale simulation using the application simulate that can run on 20 nodes and requires 3 hours to complete. It will produce an output file simulated1.dat.

8.3. Running jobs 125 VSC documentation Documentation, Release 1.0

3. preprocess second data file The data in file input2.dat has to be preprocessed using an application preprocess that runs only on a single core, and takes 1 hour to complete. The processed data is written to a file preprocessed2.dat. 4. main computation on second preprocessed file The preprocessed data in preprocessed2.dat is now used as input to run a large scale simulation using the application simulate that can run on 20 nodes and requires 3 hours to complete. It will produce an output file simulated2.dat. 5. postprocessing data However, the files simulated1.dat and simulated2.dat are not suitable for analysis, they have to be postprocessed using the application postprocess that can run on all cores of a single node, and takes 1 hour to complete. It will produce result.dat as final output.

This workflow can be executed using the job script below:

#!/usr/bin/env bash #PBS -l nodes=20:ppn=36 #PBS -l walltime=10:00:00 cd $PBS_O_WORKDIR preprocess --input input1.dat --output preprocessed1.dat simulate --input preprocessed1.dat --output simulated1.dat preprocess --input input2.dat --output preprocessed2.dat simulate --input preprocessed2.dat --output simulated2.dat postprocess --input simulated1.dat simulated2.dat --output result.dat

Just to be on the safe side, 10 hours of walltime were requested rather than the required 9 hours total. We assume that compute nodes have 36 cores. The problem obviously is that during 1/3th of the execution time, 19 out of the 20 requested nodes are idling. The preprocessing and postprocessing step run on a single node. This wastes 57 node-hours out of a total of 180 node- hours, hence your job has an efficiency of at most 68 %, which is unnecessarily low. Rather than submit this as a single job, it would be much more efficient to submit it as five separate jobs, two for preprocessing, two for the simulation, the fifth for postprocessing. 1. Preprocessing the first input file would by done by preprocessing1.pbs:

#!/usr/bin/env bash #PBS -l nodes=1:ppn=1 #PBS -l walltime=1:30:00

cd $PBS_O_WORKDIR preprocess --input input1.dat --output preprocessed1.dat

126 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

2. Preprocessing the second input file would by done by preprocessing2.pbs, similar to preprocessing1.pbs. 3. The simulation on the first preprocessed file would by done by simulation1.pbs:

#!/usr/bin/env bash #PBS -l nodes=20:ppn=36 #PBS -l walltime=6:00:00

cd $PBS_O_WORKDIR simulate --input preprocessed1.dat --output simulated1.dat

4. The simulation on the second preprocessed file would by done by simulation2.pbs that would be similar to simulation1.pbs. 5. The postprocessing would by done by postprocessing.pbs:

#!/usr/bin/env bash #PBS -l nodes=1:ppn=36 #PBS -l walltime=6:00:00

cd $PBS_O_WORKDIR postprocess --input simulated1.dat simulated2.dat --output result.dat

However, if we were to submit these three jobs independently, the scheduler may start them in any order, so the postpro- cessing job might run first, and immediately failing because the file simulated1.dat and/or simulated2.dat don’t exist yet.

Note: You don’t necessarily have to create separate job scripts for similar tasks since it is possible to parameterize job scripts.

Job dependencies

The scheduler supports specifying job dependencies, e.g., • a job can only start when two other jobs completed successfully, or • a job can only start when another job did not complete successfully. Job dependencies can effectively solve our problem since • simulation1.pbs should only start when preprocessing1.pbs finishes successfully, and • simulation2.pbs should only start when preprocessing2.pbs finishes successfully, and • postprocessing.pbs should only start when both simulation1.pbs and simulation2.pbs fin- ished successfully. It is easy to enforce this using job dependencies. Consider the following sequence of job submissions:

$ preprocessing1_id=$(qsub preprocessing1.pbs) $ preprocessing2_id=$(qsub preprocessing2.pbs) $ simulation1_id=$(qsub -W depend=afterok:$preprocessing1_id simulation1.pbs) $ simulation2_id=$(qsub -W depend=afterok:$preprocessing2_id simulation2.pbs) $ qsub -W depend=afterok:$simulation1_id:$simulation2_id postprocessing.pbs

8.3. Running jobs 127 VSC documentation Documentation, Release 1.0

The qsub command returns the job ID, and this is assigned to a bash variable. It is used in subsequent submissions to specify the job dependencies using -W depend. In this case, follow-up jobs should only be run when the previous jobs succeeded, hence the afterok dependencies. The scheduler can run preprocessing1.pbs and preprocessing2.pbs concurrently if the resources are available (and can do so on the same node). Once either is done, it can start the corresponding simulation, again potentially concurrently if 40 nodes would happen to be free. When both simulations are done, the postprocessing can start. Since each step requests only the resources it really requires, efficiency is optimal, and the total time could be as low as 5 hours rather than 9 hours if ample resources are available.

Types of dependencies

The following types of dependencies can be specified: afterok only start the job when the jobs with the specified job IDs all completed successfully. afternotok only start the job when the jobs with the specified job IDs all completed unsuccessfully. afterany only start the job when the jobs with the specified job IDs all completed, regardless of success or failure. after start the job as soon as the jobs the the specified job IDs have all started to run. A similar set of dependencies is defined for job arrays, e.g., afterokarray:[] indicates that the submitted job can only start after all jobs in the job array have completed successfully. The dependency types listed above are the most useful ones, for a complete list, see the official qsub documentation. Unfortunately, not everything works as advertized. To conveniently and efficiently execute embarrassingly parallel parts of a workflow (e.g., parameter exploration, or processing many independent inputs), the worker framework or atools will be helpful.

Job success or failure

The scheduler determines success or failure of a job by its exit status: • if the exit status is 0, the job is successful, • if the exit status is not 0, the job failed. The exit status of the job is strictly negative when the job failed because, e.g., • it ran out of walltime and was aborted, or • it used too much memory and was killed. If the job finishes normally, the exit status is determined by the exit status of the job script. The exit status of the job script is either • the exit status of the last command that was executed, or • an explicit value in a bash exit statement. When you rely on the exit status for your workflow, you have to make sure that the exit status of your job script is correct, i.e., if anything went wrong, it should be strictly positive (between 1 and 127 inclusive).

Note: This illustrates why it is bad practice to have: exit0

128 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0 as the last statement in your job script.

In our running example, the exit status of each job would be that of the last command executed, so that of preprocess, simulate and postprocess respectively.

Parameterized job scripts

Consider the two job scripts for preprocessing the data in our running example. The first one, preprocessing1.pbs is:

#!/usr/bin/env bash #PBS -l nodes=1:ppn=1 #PBS -l walltime=1:30:00 cd $PBS_O_WORKDIR preprocess --input input1.dat --output preprocessed1.dat

The second one, preprocessing2.pbs is nearly identical:

#!/usr/bin/env bash #PBS -l nodes=1:ppn=1 #PBS -l walltime=1:30:00 cd $PBS_O_WORKDIR preprocess --input input2.dat --output preprocessed2.dat

Since it is possible to pass variables to job scripts when using qsub, we could create a single job script preprocessing.pbs using two variables in_file and out_file:

#!/usr/bin/env bash #PBS -l nodes=1:ppn=1 #PBS -l walltime=1:30:00 cd $PBS_O_WORKDIR preprocess --input "$in_file" --output "$out_file"

The job submission to preprocess input1.dat and input2.dat would be:

$ qsub -v in_file=input1.dat,out_file=preprocessed1.dat preprocessing.pbs $ qsub -v in_file=input2.dat,out_file=preprocessed2.dat preprocessing.pbs

Using job dependencies and variables in job scripts allows you to define quite sophisticated workflows, simply relying on the scheduler.

8.4 Software

8.4.1 How do I run applications in parallel?

That depends.

8.4. Software 129 VSC documentation Documentation, Release 1.0

What is parallelism anyway?

Parallelism in the context of HPC can be defined at the level of 1. instructions (vectorization), 2. threads (shared memory), 3. processes (distributed memory), 4. hybrid (shared + distributed memory). Instruction-level parallelism is essentially SIMD (Single Instruction, Multiple Data) at the level of a single core in a CPU. It uses floating point or integer vector registers in the core in order to perform the same operations on multiple values in the same clock cycle. Another term for this is vectorization. Thread-level parallelism in the context of scientific computing means that the application runs multiple threads, typi- cally each on its own dedicated CPU core to exploit modern CPU architecture. All threads run on the same compute node, and can interact through shared memory. Process-level parallelism implies that an application consists of multiple processes, potentially running on many com- pute nodes, and communicating over the network. The term hybrid parallelism refers to the combination of thread-level and process-level parallelism. For instance, a process can run on each compute node, and each of these processes consists of multiple threads sharing data on that node.

Note: For most parallel applications, thread-level or process-level parallelism is combined with instruction-level parallelism.

How can I vectorize my code?

Your application will typically be parallelized at instruction level, since this is mostly done automatically by the compiler that has been used to build your application if the right compiler options are specified. • For installed software, your friendly user support person will typically have taken care of that. • If you build your own software, some relevant information can be found below. CPU instruction sets support vector operations, i.e., floating point operations such as additions and multiplications can be performed on multiple floating point numbers simultaneously. The various CPU architectures have added extensions to the original instruction sets.

CPU type vector instruction set Ivy Bridge/Sandy Bridge AVX Haswell/Broadwell AVX2 Skylake/Cascade Lake AVX-512

Software that is specifically compiled to run on, e.g., Ivy Bridge will run on a CPU of more recent generation such as a Skylake, but not with optimal performance. However, if software is built specifically to use, e.g., the AVX-512 instruction set, it will not run on older hardware such as Haswell CPUs. To build for a specific architecture both the Intel and GCC compiler family offer command line options. See the toolchain documentation for the Intel and the FOSS toolchains for an overview of the relevant compiler options.

130 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

How can I run my application with multiple threads?

This is only possible when the application has been specifically designed to do so. • For installed software, check the manual. It will be documented whether the application/library is multi- threaded, and how to use it. • If you build your own software, there is some information below.

Note: Typically, a multi-threaded application runs on a single compute node. The threads communicate by exchang- ing information in memory, hence the term shared memory computing.

There are a few commonly used approaches to create a multi-threaded application: OpenMP This is by far the most popular approach for scientific software. Many compilers (e.g., Intel and GCC) suites support it. OpenMP defines directives that can be used in C, C++ and Fortran, as well as a runtime library. Instructions are available for compiling and running OpenMP application with the foss and Intel toolchains. Threading Building Blocks (TBB) Originally developed by Intel, this open source library offers many primitives for shared memory and data driven programming in C++. POSIX threads (pthreads) Although it is possible to use a low-level threading library such as pthreads, this is typi- cally not they way to go for scientific programming.

Can I run my application on multiple compute nodes?

This is only possible when the application has been specifically designed to do so, or when your use case matches some common pattern. • For installed software, check the manual. It will be documented whether the application/library can be run distributed, and how to do that. • If you run an application many times for different parameter settings, or on different data sets, check out the worker framework documentation or the atools documentation. for a comparison, see worker or atools? • If you build your own software, there is some information below. For scientific software, the go-to library for distributed programming is an implementation of MPI (Message Passing Interface). This is a de-facto standard implemented by many libraries and the API can be used from C/C++ and Fortran. On the clusters, at least two implementations are available, Intel MPI in the Intel toolchain, and Open MPI in the FOSS toolchain.

8.4.2 Can I run containers on the HPC systems?

The best-known container implementation is doubtlessly Docker. However, due to security concerns HPC sites typi- cally don’t allow users to run Docker containers. Fortunately, Singularity addresses container related security issues, so Singularity images can be used on the cluster. Since a Singularity image can be built from a Docker container, that should not be a severe limitation.

8.4. Software 131 VSC documentation Documentation, Release 1.0

When should I use containers?

If the software you intend to use is available on the VSC infrastructure, don’t use containers. This software has been built to use specific hardware optimizations, while software deployed via containers is typically built for the common denominator. Good use cases include: • Containers can be useful to run software that is hard to install on HPC systems, e.g., GUI applications, legacy software, and so on. • Containers can be useful to deal with compatibility issues between Linux flavors. • You want to create a workflow that can run on VSC infrastructure, but can also be burst to a third-party compute cloud (e.g., AWS or Microsoft Azure) when required. • You want to maximize the period your software can be run in a reproducible way.

How can I create a Singularity image?

You have three options to build images, locally on your machine, in the cloud or on the VSC infrastructure.

Building on VSC infrastructure

Given that most build procedures require superuser privileges, your options on the VSC infrastructure are limited. You can build an image from a Docker container, e.g., to build an image that contains a version of TensorFlow and has Jupyter as well, use:

$ export SINGULARITY_TMPDIR=$VSC_SCRATCH/singularity_tmp $ mkdir -p $SINGULARITY_TMPDIR $ export SINGULARITY_CACHEDIR=$VSC_SCRATCH/singularity_cache $ mkdir -p $SINGULARITY_CACHEDIR $ singularity build tensorflow.sif docker://tensorflow/tensorflow:latest-jupyter

Warning: Don’t forget to define and create the $SINGULARITY_TMPDIR and $SINGULARITY_CACHEDIR since if you fail to do so, Singularity will use directories in your home directory, and you will exceed the quota on that file system. Also, images tend to be very large, so store them in a directory where you have sufficient quota, e.g., $VSC_DATA.

This approach will serve you well if you can use either prebuilt images or Docker containers. If you need to modify an existing image or container, you should consider the alternatives.

Note: Creating image files may take considerable time and resources. It is good practice to do this on a compute node, rather than on a login node.

Local builds

The most convenient way to create an image is on your own machine, since you will have superuser privileges, and hence the most options to chose from. At this point, Singularity only runs under Linux, so you would have to use a virtual machine when using Windows or macOS. For detailed instructions, see the Singularity installation documentation.

132 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

Besides building images from Docker containers, you have the option to create them from a definition file, which allows you to completely customize your image. We provide a brief introduction to Singularity definition files, but for more details, we refer you to the Singularity definition file documentation. When you have a Singularity definition file, e.g., my_image.def, you can build your image file my_image.sif: your_machine> singularity build my_image.sif my_image.def

Once your image is built, you can transfer it to the VSC infrastructure to use it.

Warning: Since Singularity images can be very large, transfer your image to a directory where you have sufficient quota, e.g., $VSC_DATA.

Remote builds

You can build images on the Singularity website, and download them to the VSC infrastructure. You will have to create an account at Sylabs. Once this is done, you can use Sylabs Remote Builder to create an image based on a Singularity definition. If the build succeeds, you can pull the resulting image from the library:

$ export SINGULARITY_CACHEDIR=$VSC_SCRATCH/singularity_cache $ mkdir -p $SINGULARITY_CACHEDIR $ singularity pull library://gjbex/remote-builds/rb-5d6cb2d65192faeb1a3f92c3:latest

Warning: Don’t forget to define and create the $SINGULARITY_CACHEDIR since if you fail to do so, Singu- larity will use directories in your home directory, and you will exceed the quota on that file system. Also, images tend to be very large, so store them in a directory where you have sufficient quota, e.g., $VSC_DATA.

Remote builds have several advantages: • you only need a web browser to create them, so this approach is platform-independent, • they can easily be shared with others. However, local builds still offer more flexibility, especially when some interactive setup is required.

Singularity definition files

Below is an example of a Singularity definition file:

Bootstrap: docker From: ubuntu:xenial

%post apt-get update apt-get install-y grace

%runscript /usr/bin/grace

The resulting image will be based on the Ubuntu Xenial Xerus distribution (16.04). Once it is bootstrapped, the command in the %post section of the definition file will be executed. For this example, the Grace plotting package will be installed.

8.4. Software 133 VSC documentation Documentation, Release 1.0

Note: This example is intended to illustrate that very old software that is no longer maintained can successfully be run on modern infrastructure. It is by no means intended to encourage you to start using Grace.

Singularity definition files are very flexible. For more details, we refer you to the Singularity definition file documen- tation. An important advantage of definition files is that they can easily be shared, and improve reproducibility.

How can I run a Singularity image?

Once you have an image, there are several options to run the container. 1. You can invoke any application that is in the $PATH of the container, e.g., for the image containing Grace:

$ singularity exec grace.sif xmgrace

2. In case the definition file specified a %runscript directive, this can be executed using:

$ singularity run grace.sif

3. The container can be run as a shell:

$ singularity shell grace.sif

By default, your home directory in the container will be mounted with the same path as it has on the host. The current working directory in the container is that on the host in which you invoked singularity.

Note: Although you can move to a parent directory of the current working directory in the container, you will not see its contents on the host. Only the current working directory and its sub-directories on the host are mounted.

Additional host directories can be mounted in the container as well by using the -B option. Mount points are created dynamically (using overlays), so they do not have to exist in the image. For example, to mount the $VSC_SCRATCH directory, you would use:

$ singularity exec -B $VSC_SCRATCH:/scratch grace.sif xmgrace

Your $VSC_SCRATCH directory is now accessible from within the image in the directory /scratch.

Note: If you want existing scripts to work from within the image without having to change paths, it may be convenient to use identical mount points in the image and on the host, e.g., for the $VSC_DATA directory:

$ singularity exec -B $VSC_DATA:$VSC_DATA grace.sif xmgrace

Or, more concisely:

$ singularity exec -B $VSC_DATA grace.sif xmgrace

The host environment variables are defined in the image, hence scripts that use those will work.

134 Chapter 8. Frequently asked questions (FAQs) VSC documentation Documentation, Release 1.0

Can I use singularity images in a job?

Yes, you can. Singularity images can be part of any workflow, e.g., the following script would create a plot in the Grace container:

#!/bin/bash -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:30:00

cd $PBS_O_WORKDIR singularity exec grace.sif gracebat -data data.dat \ -batch plot.bat

Ensure that the container has access to all the required directories by providing additional bindings if necessary.

Can I run parallel applications using a Singularity image?

For shared memory applications there is absolutely no problem. For distributed applications it is highly recommended to use the same implementation and version of the MPI libraries on the host and in the image. You also want to install the appropriate drivers for the interconnect, as well as the low-level communication libraries, e.g., ibverbs. For this type of scenario, it is probably best to contact user support.

Note: For distributed applications you may expect some mild performance degradation.

Can I run a service from a Singularity image?

Yes, it is possible to run services such as databases or web applications that are installed in Singularity images. For this type of scenario, it is probably best to contact user support. Feel free to contact user support.

8.4. Software 135 VSC documentation Documentation, Release 1.0

136 Chapter 8. Frequently asked questions (FAQs) Index

B BLAS, 54 C compiler, 54, 57 E Endpoint, 117 G Globus, 117 Globus Collection, 117 Globus Connect, 117 I Intel MPI, 54 L LAPACK, 54 M MKL, 54 MPI, 54, 57 O Open MPI, 57 OpenMP, 54, 57

137