HPC wiki Documentation Release 2.0

Hurng-Chun Lee, Daniel Sharoh, Edward Gerrits, Marek Tyc, Mike van Engelenburg, Mariam Zabihi

Sep 15, 2021

Contents

1 About the wiki 1

2 Table of Contents 3 2.1 High Performance Computing for Neuroimaging Research...... 3 2.2 tutorial...... 4 2.3 Introduction to the Linux shell...... 21 2.4 The HPC cluster...... 37 2.5 The project storage...... 102 2.6 Linux & HPC workshops...... 105

i ii CHAPTER 1

About the wiki

This wiki contains materials used by the Linux and HPC workshop held regularly at Donders Centre for Cognitive Neuroimaging (DCCN). The aim of this workshop is to provide researchers the basic knowledage to use the High- Performance Computing (HPC) cluster for data analysis. During the workshop, the wiki is used in combination with lectures and hands-on exercises; nevertheless, contents of the wiki are written in such that they can also be used for self-learning and references. There are two major sessions in this wiki. The Linux basic consists of the usage of the Linux and an introduction to the Bash . After following the session, you should be able to create text-based data files in a Linux system, and write a bash to perform simple data analysis on the file. The cluster usage focuses on the general approach of running computations on the Torque/Moab cluster. After learning this session, you should be knowing how to distribute data analysis computations to the Torque/Moab cluster at DCCN.

1 HPC wiki Documentation, Release 2.0

2 Chapter 1. About the wiki CHAPTER 2

Table of Contents

2.1 High Performance Computing for Neuroimaging Research

Fig. 1: Figure: the HPC environment at DCCN.

2.1.1 HPC Cluster

The HPC cluster at DCCN consists of two groups of computers, they are: • access nodes: mentat001 ~ mentat005 as login nodes. • compute nodes: a pool of powerful computers with more than 1000 CPU cores. Computer nodes are managed by the Torque job manager and the Moab job scheduler. While the access nodes can be accessed via either a SSH terminal or a VNC session, compute nodes are only accessible by submiting computational jobs.

2.1.2 Central Storage

The central storage provides a shared file system amongst the Windows desktops within DCCN and the computers in the HPC cluster. On the central storage, every user has a personal folder with a so-called office quota (20 gigabytes by default). This personal folder is referred to as the M:\ drive on the Windows desktops. Storage spaces granted to research projects (following the project proposal meeting(PPM)) are also provided by the central storage. The project folders are organised under the directory /project which is referred to as the P:\ drive on the Windows desktops. The central storage also hosts a set of commonly used software/tools for neuroimaging data processing and analysis. This area in the storage is only accessible for computers in the HPC cluster as software/tools stored there require the Linux operating system.

3 HPC wiki Documentation, Release 2.0

2.1.3 Identity Manager

The identity manager maintains information for authenticating users accessing to the HPC cluster. It is also used to check users’ identity when logging into the Windows desktops at DCCN. In fact, the user account received from the DCCN check-in proceduer is managed secretely by this identity manager.

Note: The user account concerned here (and throughout the entire wiki) is the one received via the DCCN check-in procedure. It is, in most of cases, a combination of the first three letters of your first name and the first three letters of your last name. It is NOT the account (i.e. U/Z/S-number) from the Radboud University.

2.1.4 Supported Software

A list of supported software can be found here.

2.2 Linux tutorial

2.2.1 Very short introduction of Linux

Linux is an operating system originally developed by Linus Torvalds in 90’s for cloning the operating system to personal computers (PCs). It is now one of the world-renowned software projects developed and managed by the open-source community. With its open nature in software development, free in (re-)distribution, and many features inherited directly from Unix, the Linux system provides an ideal and affordable environment for software development and scientific computation. It is why Linux is widely used in most of scientific computing systems nowadays.

4 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Architecture

The figure above illustrates a simplified view of the Linux architecture. From inside out, the core of the system is called the kernel. It interacts with hardware devices, and provides upper layer components with low-level functions that hide complexity of, for example, arranging concurrent accesses to hardware. The shell is an interface to the kernel. It takes commands from user (or application) and executes kernel’s functions accordingly. Applications are generally refer to system utilities providing advanced functionalities of the operating system, such as the tool cp for copying files.

File and process

Everything in Linux is either a file or a process. A process in Linux refers to an executing program identified by an unique process identifier (PID). Processes are internally managed by the Linux kernel for the access to hardware resources (e.g. CPU, memory etc.). In most of cases, a file in Linux is a collection of data. They are created by users using text editors, running compilers etc. Hardware devices are also represented as files in the Linux.

Linux distributions

Nowadays Linux is made available as a collection of selected software packages based around the Linux kernel. It is the so-called Linux distribution. As of today, different Linux distributions are available on the market, each addresses the need of certain user community. In the HPC cluster at DCCN, we use the CentOS Linux distribution. It is a well-maintained distribution developed closely with RedHat, a company providing commercial Linux distribution and support. It is also widely used in many scientific computing systems in the world.

2.2. Linux tutorial 5 HPC wiki Documentation, Release 2.0

2.2.2 Getting started with Linux

By following this wiki, you will login to one of the access nodes of the HPC cluster, learn about the Linux shell and issue a very simple Linux command on the virtaul terminal.

Obtain a user account

Please refer to this guide.

SSH login with Putty

Please refer to this guide.

The prompt of the shell

After you login to the access node, the first thing you see is a welcome message together with couple of news messages. Following the messages are few lines of text look similar to the example below:

honlee@mentat001:~ 999$

Every logged-in users is given a shell to interact with the system. The example above is technically called the prompt of the Linux shell. It waits for your commands to the system. Following the prompt, you will type in commands to run programs.

Note: For the simplicity, we will use the symbol $ to denote the prompt of the shell.

Environment variables

Every Linux shell comes with a set of variables that can affect the way running processes will behave. Those variables are called environment variables. The command to list all environment variables in the current shell is

$ env

Tip: The practical action of running the above command is to type env after the shell prompt, and press the Enter key.

Generally speaking, user needs to set or modify some default environment variables to get a particular program running properly. A very common case is to adjust the PATH variable to allow the system to find the location of the program’s executable when the program is launched by the user. Another example is to extend the LD_LIBRARY_PATH to include the directory where the dynamic libraries needed for running a program can be found. In the HPC cluster, a set of environment variables has been prepared for the data analysis software supported in the cluster. Loading (or unloading) these variables in a shell is also made easy using the Environment Modules. For average users, it’s not even necessary to load the variables explicitly as a default set of variables corresponding to commonly used neuroimaging software are loaded automatically upon the user login. More details about using software in the HPC cluster is found here).

6 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Knowing who you are in the system

The Linux system is designed to support multiple concurrent users. Every user has an account (i.e. user id) that is the one you used to login to the access node. Every user account is associated with at-least one group in the system. In the HPC cluster at DCCN, the system groups are created in response of the research (i.e. PI) groups. User accounts are associated with groups according to the registration during the check-in procedure. To know about your user id and the system group you are associated with, simply type id followed by pressing the Enter key to issue the command on the prompt. For example:

$ id uid=10343(honlee) gid=601(tg) groups=601(tg)

Using online manuals

A linux command comes with options for additional functionalities, the online manual provides a handy way to find the supported options of a command. To access to the online manual of a command, one use the command man followed by the command in question. For example, to get all possible options of the id command, one does

$ man id

2.2.3 Understanding the Linux file system

Data and software programs in the Linux system are stored in files organised in directories (i.e. folders). The file system is responsible for managing the files and directories. In this wiki, you will learn about the tree structure of the file system and understand the syntax used to represent the file type and access permission. You will also learn the commands for creating, moving/copying, and deleting file and directories in the file system.

Present working directory

Right after you login to a Linux system, you are in certain working directory in the file system. It is the so-called present working directory. Knowing which working directory you are currently in can be done with the command pwd. For example,

$ pwd /home/tg/honlee

The system responses to this command with a string representing the present working directory in a special notation. This string is referred to as the path to the present working directory. The string /home/tg/honlee from the above example is interpreted as follows: In the Linux filesystem, directories and files are organised in a tree structure. The root of the tree is denoted by the / symbol as shown at the beginning of the string. Following that is the first-level child directory called home. It is then separated from the second-level child tg by an additional / symbol. This notation convention repeats while moving down the child-directory levels, until the present working directory is reached. For instance, the present working directory in this example is the third-level child from the root, and it’s called honlee. The hierarchy is also illustrated in the diagram below:

/ <-- the root directory |-- home <-- first-level child (continues on next page)

2.2. Linux tutorial 7 HPC wiki Documentation, Release 2.0

(continued from previous page) | |-- tg <-- second-level child | | |-- honlee <-- the present working directory

Changing the present working directory

With the file path notation, one changes the present working directory in the system using the cd command. Continue with the above example, if we want to move to the tg directory, we do:

$ cd /home/tg

Since the directory tg is one level up with respect to the present working directory, it can also be referred by the .. symbol. Therefore, an alternative to the previous command is:

$ cd..

The difference in between is that in the first command the directory tg is referred from the root directory using the so- called absolute path; while in the second it is referred relatively from the present working directory with the relative path.

Tip: The relative path to the present working directory is denoted by the symbol .

The personal directory

Every user has a personal directory in which the user has full access permission to manage data stored in it. The absolute path of this directory is referred by an environment variable called $HOME. Thus, one can always use the following command to change the present working directory to the personal directory.

$ cd $HOME

Tip: One can also leave out $HOME in the above cd command to move to the personal directory.

Listing files in a directory

For listing files and sub-directories in the present working directory, one use the ls command. For example,

$ ls

The option -l is frequently used to get more information about the files/directories. For example,

$ ls -l total 68 drwxr-xr-x2 honlee tg 4096 Aug 12 13:09 Desktop drwxr-xr-x2 honlee tg 4096 Aug 21 16:15 matlab drwx------5 honlee tg 4096 Mar7 14:37 opt -rw-r--r--1 honlee tg 84 Mar5 10:47 startup.m -rwxr-xr-x1 honlee tg 737 Aug 19 12:56 test.sh

8 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

File information are provided in columes. They are summarised in the following table:

Column Example Information 1 drwxr-xr-x indicator for file type and access permission 2 2 number of links to the file 3 honlee user ownership 4 tg group ownership 5 4096 size of file in byte 6-8 Aug 12 13:09 time of the last modification 9 Desktop name of the file

File type and permission

The indicator for file type and access permission requires an interpretation, showing graphically in the picture below.

The first character presents the type of the file. In most of cases, you will see the character of d, -, or l corresponding to directory, regular or link file respectively. The file-type character is followed by 9 additional characters organised in three sets, each consists of three characters representing the read (r), write (w) and execute (x) permissions of the file. If certain permission is disabled, a - is shown instead. The three sets, from left to right, indicate permissions for the user, the group (i.e. all users in the group), and others (i.e. all other users in the system). The user and group considered here are the user and group ownership (see the third and fourth columns of the table).

Changing file permission

When you are the owner of a file (or you have the write permission of it), you can change the file permission. To change the permission, we use the chmod command.

2.2. Linux tutorial 9 HPC wiki Documentation, Release 2.0

For example, to make a file call test readable for all users in the system, one does

$ chmod o+r test

The syntax o+r stands for add read permission for others. By replacing the character o with u or g, one adds read permission for user or group. Replacing r with w or x will set write or execute permission instead of read. Using - instead of + removes permissions accordingly.

Copying and (re-)moving files

For copying a file, one uses the cp command. Assuming there is a file at path /home/tg/test, to make a copy of it and place the copy at path /home/tg/test.copy, one does

$ cp /home/tg/test /home/tg/test.copy

It requires the -R option to copy a directory. For example, to copy a directory at path /home/tg/test_dir to /home/tg/test_dir.copy, one does

$ cp -R /home/tg/test_dir /home/tg/test_dir.copy

For moving a file/directory from one path to another, one uses the mv command:

$ mv /home/tg/test /home/tg/test.move $ mv /home/tg/test_dir /home/tg/test_dir.move

To delete (remove) a file from the file system, one uses the rm command:

$ rm /home/tg/test

When deleting a directory from the file system, the directory should be emptied first, i.e. not contains any files or sub- directories in it. The -r option simplify the deletion of a directory by removing files and sub-directories iteratively.

Creating new directory

Creating a directory is done by using the mkdir command. The following command create a new directory at path /home/tg/new_dir.

$ mkdir /home/tg/new_dir

The system assumes that the parent paths (/home and /home/tg) exist a prior the creation of /home/tg/ new_dir. The option -p is used to create necessary parent directories.

Using wildcards

Wildcards are special syntax in specifying a group of files with some part of their names in common. Linux commands can use wildcards to perform actions on more than one files at a time. The mostly used wildcard syntax is the asterisk * representing any number of characters. In the example below, the wildcard is used to remove files with prefix subject_ and suffix .out in the present workding directory

10 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

$ ls subject_1.dat subject_2.dat subject_3.dat subject_4.dat subject_5.dat subject_1.out subject_2.out subject_3.out subject_4.out subject_5.out

$ rm subject_*.out

$ ls subject_1.dat subject_2.dat subject_3.dat subject_4.dat subject_5.dat

Tip: More wildcard syntax can be found here.

2.2.4 Working with text files

Given the simplicity and readability, text files are widely used in computing system for various purposes. In this practice, we will use text files to store numerical data. A benefit of storing data in text file is that many tools coming along with the Linux system can be used directly to process the data. In the examples below, we will create two text files to store the final-exame scores of four students in the mathematics and language courses. We will then introduce few usueful Linux commands to browse and analysis the data. Before we start, make sure the directory $HOME/tutorial/labs is already available; otherwise create it with

$ mkdir -p $HOME/tutorial/labs and change the present working directory to it:

$ cd $HOME/tutorial/labs

Creating and editing text file

There are many text editors in Linux. Here we use the editor called nano which is relatively easy to adopt. Let’s firstly create a text file called score_math.dat using the following command:

Note: In Linux, the suffix of the filename is irrelevant to the file type. Use the file command to examine the file type.

$ nano score_math.dat

You will be entering an empty editing area provided by nano. Copy or type the following texts into the area:

Thomas 81 Percy 65 Emily 75 James 55

Press Control+o followed by the Enter key to save the file. Press Control+x to quit the editing environment and return to the prompt. Now repeat the steps above to create another file called score_lang.dat, and paste the data below into it.

2.2. Linux tutorial 11 HPC wiki Documentation, Release 2.0

Thomas 53 Percy 85 Emily 70 James 65

When you list of the content of the present working directory, you should see the two data files.

$ ls -l total0 -rw-r--r--1 honlee tg 40 Sep 30 15:06 score_lang.dat -rw-r--r--1 honlee tg 37 Sep 30 15:06 score_math.dat

Browsing text file

Several commands can be used to brows the text file. First of all, the command cat can be used to print the entire content on the terminal. For example:

$ cat score_math.dat

When the content is too large to fit into the terminal, one uses either more or less command to print contents in pages. For example,

$ more score_math.dat $ less score_math.dat

Tip: The command less provides more functionalities than the more command such as up/down scrolling and text search.

When the top and bottom of the content are the only concern, one can use the commands tail and head. To print the first 2 lines, one does

$ head -n2 score_math.dat

To print the last 2 lines, one does

$ tail -n2 score_math.dat

Searching in text file

For search a string in text file, one use the command grep. For example, if we would like to search for the name Thomas in the file score_math.dat, we do

$ grep 'Thomas' score_math.dat

Tip: grep supports advanced pattern searching using the regular expression.

2.2.5 Extracting information from data

This practice will continue work on the two data files created in Working with text (data) file. The aim is to present how to extract interesting information out of the data, using some simple but powerful command-line tools of Linux.

12 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

You should have the following two files in the present working directory:

$ cat score_math.dat Thomas 81 Percy 65 Emily 75 James 55

$ cat score_lang.dat Thomas 53 Percy 85 Emily 70 James 65

Data sorting

If we wonder who has the highest score in the language course, a way to get the answer is applying the sort command on the text file. For example,

$ sort -k2 -n -r score_lang.dat Percy 85 Emily 70 James 65 Thomas 53

Here we use option -k to sort data on the second column, -n to treat the data as numerical value (instead of text characters by default), and make the sorting decendent with option -r. Voila, Percy has the highest score in the language class.

Data processing

Using the awk, a pattern scanning and processing language, one can already perform some statistical calculation on the data without the need of advanced tools such asR. The example below shows a way to calculate the arithmetic mean of the score in the language class.

$ awk 'BEGIN {cnt=0; sum=0;} {cnt += 1; sum += $2;} END {print "mean:", sum/cnt}'

˓→score_lang.dat mean: 68.25

The example above shows the basic structure of the awk language. It consists of three parts. For the explanation here, we call them the pre-processor, processor and post-processor. They are explained below: • Pre-processor starts with a keyword BEGIN followed by a piece of codes enclosed by the curly braces (i.e. BEGIN{ ... }). It defines what to do before awk starts processing the data file. In the example above, we initiate two variables called cnt and sum for storing the number of students and the sum of the scores, respectively. • The context of the processor is merely enclosed by the curly braces (i.e. { ... }), and it follows right after the pre-processor. The processor defines what to do for each line in the data file. It uses the index variables to refer to the data in a specific column in a line. The variable $0 refers to the whole line; and variables $n to the data in the n-th column. In the example, we simply add 1 to the conunter cnt, and increase the sum by the score taken from the 2nd column. • Post-processor is initiated with a keyword END with context enclosed again by another curly braces (i.e. END{ ... }). Here in the example, we simply calculate the arithmetic mean and print it.

2.2. Linux tutorial 13 HPC wiki Documentation, Release 2.0

Data filtering

One can also use awk to create filters on the data. The example below selects only those with score lower than 70.

$ awk '{ if ( $2 < 70 ) print $0}' score_math.dat Percy 65 James 55

Data processing pipeline

Every running command is treated as a process in the Linux system. Every process is attached with three data streams for receiving data from an input device (e.g. a keyboard), and for printing outputs and errors to an output device (e.g. a screen). These data streams are technically called STDIN, STDOUT and STDERR standing for the standard input, standard output and standard error, respectively. An import feature of these data streams is that the output stream (e.g. STDOUT) of a process can be connected to the input stream (STDIN) of another process to form a data processing pipeline. The very symbol for constructing the pipeline is |, the pipe. In the following example, we assume that we want to make a nice-looking table out of the two score files. The table will list the name of the student, the score for each class, and the total score of the student. Firstly we have to put the data from the two text files together, using the paste command:

$ paste score_lang.dat score_math.dat Thomas 53 Thomas 81 Percy 85 Percy 65 Emily 70 Emily 75 James 65 James 55

But the output looks ugly! Furthermore, it’s just half way to what we want to have. It is where the process pipeline plays a role. We now revise our command to as the follows:

$ paste score_lang.dat score_math.dat | awk 'BEGIN{print "name\tlang\tmath\ttotal";

˓→print "---"} {print $1"\t"$2"\t"$4"\t"$2+$4}' name lang math total --- Thomas 53 81 134 Percy 85 65 150 Emily 70 75 145 James 65 55 120

Note: In the Linux shell, the string "\t" represents the Tab key. It is a way to align data in a column.

Here the pipeline is constructed in such that we firstly put together the data in the two files using the paste command, and connect the output stream of it to the input stream of the awk command to create the nice-looking table.

Saving output to file

When you have processed the data and produced a nice-looking table, it would be a good idea to save the output to the file rather than print to the screen. Here we will discuss another important feature of the STDOUT and STDERR data streams: the output redirection. The following command will produce the nice-looking table again, but instead of printing table to the terminal, it will be saved to a file called score_table.txt by redirecting the output.

14 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

$ paste score_lang.dat score_math.dat | awk 'BEGIN{print "name\tlang\tmath\ttotal";

˓→print "---"} {print $1"\t"$2"\t"$4"\t"$2+$4}' > score_table.txt

Tip: Output redirection with > symbol will override the content of an existing file. One could use the >> symbol to append new data to the existing file.

Note that the above command only redirects the STDOUT stream to a file, data to the STDERR stream will still be printed to the terminal. There are two approaches to save the STDERR stream to file: 1. Merge STDERR to STDOUT

$ paste score_lang.dat score_math.dat | awk 'BEGIN{print "name\tlang\tmath\ttotal";

˓→print "---"} {print $1"\t"$2"\t"$4"\t"$2+$4}' > score_table.txt2>&1

2. Save STDERR to separate file

$ paste score_lang.dat score_math.dat | awk 'BEGIN{print "name\tlang\tmath\ttotal";

˓→print "---"} {print $1"\t"$2"\t"$4"\t"$2+$4}'1>score_table.txt2>score_table.err

2.2.6 Exercise: file system operations

Note: Please try not just copy-n-pasting the commands provided in the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

In this exercise, we will get you familiar with the Linux file system. Following the steps below, you will perform certain frequently used commands to perform operations on the file system, including • browsing files and sub-directories within a directory, • creating and removing directory, • moving current working directory between directories, • changing access permission of a directory, • creating and deleting files. You will also learn few useful wildcard syntax to make things done quicker and easier.

Tasks

1. Change the present working directory to your personal directory

$ cd $HOME

2. Create a new directory called tutorial

$ mkdir tutorial

3. Change the present working directory to the tutorial directory

2.2. Linux tutorial 15 HPC wiki Documentation, Release 2.0

$ cd tutorial

4. Create two new directories called labs and exercises

$ mkdir labs $ mkdir exercises

5. Remove all access permissions of others from the exercises directory

$ chmod o-rwx exercises

6. Set groups to have read and execute permissions on the exercises directory

$ chmodg=rx exercises

7. Change the present working directory to $HOME/tutorial/labs

$ cd $HOME/tutorial/labs

8. Create multiple empty files (and list them) using wildcards. Note the syntax {1..5} in the first command below. It is taken by the Linux shell as a serious of sequencial integers from 1 to 5.

$ touch subject_{1..5}.dat

$ ls -l subject_* -rw-r--r--1 honlee tg0 Sep 30 16:24 subject_1.dat -rw-r--r--1 honlee tg0 Sep 30 16:24 subject_2.dat -rw-r--r--1 honlee tg0 Sep 30 16:24 subject_3.dat -rw-r--r--1 honlee tg0 Sep 30 16:24 subject_4.dat -rw-r--r--1 honlee tg0 Sep 30 16:24 subject_5.dat

Tip: The touch command is used for creating empty files.

9. Remove multiple files using wildcards. Note the syntax *. It is taken as “any characters” by the Linux shell.

$ rm subject_*.dat

2.2.7 Exercise: Familiarize Yourselves with Redirects

The typical shells used in Linux environments allow for redirecting input and output to additional commands. The basic redirects you will use today are > >> and | You can generally use these redirects with any standard command line utility.

Your Task

1. Either make a new directory or go to an existing directory that you made in the previous exercise. Take a few minutes to try each of these three redirects with arbitrary commands to improve your understanding of their functionality.

Hint: Try some commands like these shown below. Experiment with other commands you learned about in the slides this morning, or come of the commands on your cheat sheet. Notice that you can stack redirects multiple times, as in

16 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0 the first example.

$ ls /home | sort > file.txt $ echo hello > file.txt $ echo hello >> file.txt

2.2.8 Exercise: Using Wildcards

Note: Please try not just copy-n-pasting the commands provided in the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

Preparation

Move into a directory you’d like to work in (make a new directory if you like), and run the command

$ touch gcutError_recon-all.log s10_recon-all.log s1_recon-all.log s6_recon-all.log

˓→s8_recon-all.log

This will create empty files for the purpose of this exercise.

Background

A handy way to refer to many items with a similar pattern is with wildcards. These were described so far in the lectures, but mainly consist of the characters: • * matches everything • ? matches any single character • [] matches any of the letters or numbers, or a range of letters or numbers inside the brackets With BASH, the shell itself expands the wildcards. This means that the commands usually don’t see these special characters because BASH has already expanded them before the command is run. Try to get a feel for wildcards with the following examples

$ ls *recon-all.log gcutError_recon-all.log s10_recon-all.log s1_recon-all.log s6_recon-all.log

˓→s8_recon-all.log

$ ls gcut* gcutError_recon-all.log

$ ls s[0-9]* s10_recon-all.log s1_recon-all.log s6_recon-all.log s8_recon-all.log

$ ls s[0-9]_* s1_recon-all.log s6_recon-all.log s8_recon-all.log

$ ls s[0-9][0-9]_* s10_recon-all.log

2.2. Linux tutorial 17 HPC wiki Documentation, Release 2.0

$ ls[a-z][0-9][0-9]???con-all.log s10_recon-all.log

$ ls s?_recon-all.log s1_recon-all.log s6_recon-all.log s8_recon-all.log

Do you understand all of the patterns and how they returned what they did? The [] wildcard has the most complex syntax because it is more flexible. When BASH sees the [] characters, it will try to match any of the characters or a range of characters it sees inside them. A range of characters is specified by separating two search characters with the - character. Some legal patterns would be [0-9], [5-8], [a-Z], or [ady1-3]. Another handy trick is to use the ! character to negate a search pattern inside []. For instance, [!0-9] means don’t return anything with a value between 0 and 9. Take a look at next examples to get a feel for this very useful globbing character. • matching all strings starting with s1 followed by any of numbers from 0 to 9, followed then by anything.

$ ls s1[0-9]* s10_recon-all.log

• matching all strings starting with any of a range of letters from a to Z

$ ls[a-Z] * gcutError_recon-all.log s10_recon-all.log s1_recon-all.log s6_recon-all.log s8_

˓→recon-all.log

• matching all strings starting with s, g, or 0.

$ ls[sg0] *

• matching all strings that do not start with s

$ ls[!s] * gcutError_recon-all.log

Your Task

1. Find a search pattern that will return all files ending in .txt 2. Find a search pattern that will return all files starting in s and ending in .log 3. Find a search pattern that will return all files starting s followed by two numbers 4. Find a search pattern that will return all files only starting s followed by one number

Solution

1. ls *.txt 2. ls s*.log 3. ls s[0-9][0-9]* 4. ls s[0-9][!0-9]*

18 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Clean up

When your finished and have checked the solution, run the command below to remove the files we were working with. If you don’t do this, the next exercise will give you trouble.

$ rm gcutError_recon-all.log s10_recon-all.log s1_recon-all.log s6_recon-all.log s8_

˓→recon-all.log

2.2.9 Exercise: play with text-based data file

Note: Please try not just copy-n-pasting the commands provided in the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

Preparation

Download this data file using the following command:

$ wget https://raw.githubusercontent.com/Donders-Institute/hpc-wiki-v2/master/docs/

˓→linux/exercise/gcutError_recon-all.log

This data file is an example output file from a freesurfer command submitted to the cluster using qsub. In this simple task we are going to try to extract some information from it using a few commands.

Your Task

1. Construct a Linux command pipeline to get the subject ID associated with the log file. The subejct ID is of the form Subject##, i.e Subject01, Subject02, Subject03, etc. Use one command to send input to grep, and then use grep to search for a pattern. If you’re a bit confused, take a look at the hints and the example grep command below. You’ll have to modify it to get the result you want.

Hint: • Commands separated with a pipe, the | character, send the output of the command to the left of the pipe as input to the command on the right of the pipe. • Think back on the exercise about wildcards. grep uses something called regular expressions that are similar to wildcards, but much more extensive. For grep regexps, * and [] work the same way as they do in wildcards. For a fuller treatment of regexps, click here. For a quick example see below. You can grep for a search term in a file with something like the following:

#example grep command $ cat file.txt | grep SEARCHTERM # where searchterm can be something like $ cat file.txt | grep "[0-9][0-9].*" # this search term would find matches in strings that start with two numbers

˓→followed by anything

2. If you completed Task 1, you were able to find the output you wanted, but there was much more output sent to the screen than you needed. Construct another pipeline to limit the output of grep to only the first line.

2.2. Linux tutorial 19 HPC wiki Documentation, Release 2.0

Hint: Think of a command that prints the first n lines of a file. You can always google the task if you can’t think of the right tool for the job.

Solution

Solution to Task 1

$ cat gcutError_recon-all.log | grep "Subject[0-9][0-9]" /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/FreeSurfer -subjid FreeSurfer -i /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/

˓→Scans/Anatomical/MP2RAGE/MP2RAGE.nii -all setenv SUBJECTS_DIR /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05 /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/FreeSurfer mri_convert /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/Scans/

˓→Anatomical/MP2RAGE/MP2RAGE.nii /home/language/dansha/Studies/LaminarWord/

˓→SubjectData/Subject05/FreeSurfer/mri/orig/001.mgz mri_convert /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/Scans/

˓→Anatomical/MP2RAGE/MP2RAGE.nii /home/language/dansha/Studies/LaminarWord/

˓→SubjectData/Subject05/FreeSurfer/mri/orig/001.mgz reading from /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/Scans/

˓→Anatomical/MP2RAGE/MP2RAGE.nii... writing to /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/FreeSurfer/

˓→mri/orig/001.mgz... /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/FreeSurfer/mri/orig/

˓→001.mgz cp /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/FreeSurfer/mri/

˓→orig/001.mgz /home/language/dansha/Studies/LaminarWord/SubjectData/Subject05/

˓→FreeSurfer/mri/rawavg.mgz

Hint: Note that you could also have run the command

$ grep "Subject[0-9][0-9]" gcutError_recon-all.log to get the same results. The traditional unix command line tools typically provide many ways of doing the same thing. It’s up to the user to find the best way to accomplish each task. grep is an excellent tool. To learn more about what you can search, try man grep. You can also google for something like “cool stuff I can do with grep.”

Solution to Task 2

$ grep "Subject[0-9][0-9]" gcutError_recon-all.log | head -1

You could have also done

$ grep -m1 "Subject[0-9][0-9]" gcutError_recon-all.log $ cat gcutError_recon-all.log | grep "Subject[0-9][0-9]" | head -1 $ cat gcutError_recon-all.log | grep -m1 "Subject[0-9][0-9]"

There are usually many ways to do the same thing. Look up the -m option in the grep man page if you’re curious!

20 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Closing Remarks

These are just simple examples. You see the real power of the unix command line tools when you add a little, soon to come, scripting know-how. A simple example of a more powerful way to use grep is in a case where you have 543 subject logs (not impossible!), and you need to search through all of them for subjects who participated in a version of your experiment with a bad stimuli list. grep is an excellent tool for this!

2.3 Introduction to the Linux BASH shell

2.3.1 Get started with bash script

A great feature of the Linux shell is its programming capability. This feature provides feasibility of managing complex computations. This session focuses on the basic of the bash script. You will learn how to compose a simple bash script, make the script executable and run it as a shell command.

The first script in action

Follow the steps below to write our first bash script, and put it in action. • Change present working directory to $HOME/tutorial/libs

$ cd $HOME/tutorial/libs

• Create a new text file called hello_me.sh

$ nano hello_me.sh

• Save the following texts into the file

1 #!/bin/bash

2

3 # The -n option of echo command do not print the new line character at the end,

4 # making the output from the next command to show on the same line.

5 echo -n "Hello! "

6

7 # Just run a system command and let the output printed to the screen

8 whoami

9

10 # Here we capture the output of the command "/bin/hostname",

11 # assigning it to a new variable called "server".

12 server=$(/bin/hostname)

13

14 # Here we compose a text message and assign it to another variable called "msg".

15 msg="Welcome to $server"

16

17 # Print the value of the variable "msg" to the terminal.

18 echo $msg

• Change the file permission to executable

$ chmod a+x hello_me.sh

• Run the script as a command-line tool

2.3. Introduction to the Linux BASH shell 21 HPC wiki Documentation, Release 2.0

$ ./hello_me.sh

Note: In addtion to just typing the script name in the terminal, we add ./ in front. This enforces the system to load the executable (i.e. the script) right from the present working directory.

Interpreter directive

Generally speaking, a is essentially a text file starting with an interpreter directive. The interpreter directive specifies which interpreter program should be used to translate the file contents into instructions executed by the system. The directive always starts with #! (a number sign and an exclamation mark) followed by the path to the executable of the interpreter. Since we are going to use the interpreter of the bash shell, the executable of it is /bin/bash.

Comments

Except for the first line that is meant for the interpreter directive, texts following a # (number sign) in the same line are treated as comments. They will be ignored by the interpreter while executing the script. In BASH, there is no special syntax for block comments.

Shell commands

Running shell commands via a script is as simple as typing the commands into the text file, just like they are in the termianl. A trivial example is show on line 8 where the command whoami is called to get the user id.

Variables

Variables are used to store data in the script. This is done by assigning value to variable. Two different ways are shown in the example script: 1. The first way is shown on line 12 where the variable server is given a value captured from the output of the /bin/hostname command. For capturing the command output, the command is enclosed by a parenthesis () following the dollar sign $. 2. The second way shown on line 15 is simply assigning a string to the variable msg.

Note: When assigning value to variable, there SHOULD NOT be any space characters around the equal sign =.

Tip: Environment variables are also accessible in the script. For example, one can use $HOME in the script to get the path to the personal directory on the file system.

Note: BASH variables are type-free, meaning that you can store any type of data, such as a string, a number or an array of data, to a variable without declaring the type of it in advance. This feature results in speedy coding and enables flexibility in recycling variable names; but it can also lead to con- flict/confusion at some point. Keep this feature in your mind when writing a complex code.

22 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.3.2 Useful information

cheat sheet

A PDF version of the cheatsheet can be downloaded from here.

key-combinations in terminal

Note: These key combinations will not work with all terminal applications (i.e nano, etc) because specific programs may have the key combinations already assigned to another purpose. In other cases, the terminal program itself may not interpret these characters in a typical way.

The ^ character indicates the Control button. When you see it next to another character, it means to hold down the Ctrl button while you push that character. For example, ^ means to hold down Ctrl and then press the c button while you are holding down Control. In the case of ^shift+c it means to hold down Control AND Shift buttons while pushing the c button.

key- function combination ^shift+c copy highlighted text in terminal. Highlight text by clicking and dragging, just like in any application. ^shift+v paste text into terminal. Text copied from the terminal will be available in other applications using the typical ^v key combination. ^c send the SIGINT signal to a program. Will usually quit any process currently running in the terminal. It will not quit certain programs, like nano, but it will by default terminate a running script. ^a move the cursor to the beginning of the line in the terminal ^e move the cursor to the end of the line in the terminal ^k delete everything after the cursor on one line

The rest of these aren’t as important, but may still be useful to you:

key-combination function ^w delete one word backward from the cursor ^b move the cursor one character backward ^f move the cursor one character forward Alt-f (hold down the Alt button and then press f) will move the cursor one word forward Alt-b move the cursor one word backward

Handy commands

The following cd commands help you to move around in the Linux filesystem:

command function cd - change dir to the previous directory you were just in cd ../ change dir to one directory back, you can move as many directories back with this syntax as you like cd ../../ change dir to two directories back and one directory forward into the directory Dir (should be Dir on one line) cd ~ change dir to the home directory

2.3. Introduction to the Linux BASH shell 23 HPC wiki Documentation, Release 2.0

Changing the PATH variable

At a BASH prompt, type:

$ PATH=$PATH:/path/to/new/directory/

You can add as many directories as you like. If you want to add more the syntax would be

$ PATH=$PATH:/path/to/first/directory/:/path/to/second/directory/:/and/so/on/

Note: If you find that none of your commands are found after you tried to change PATH, then you have accidentally deleted you PATH variable. Restart bash (reopen the terminal application) and it will go back to normal.

Changing the $HOME/.bashrc

First, it is a good idea to back up the file if you plan to make changes.

$ cp ~/.bashrc ~/.bashrc.bak

Then you can open the bashrc file to modify with the command:

$ nano ~/.bashrc

You will then see a minimal bashrc file that the TG has configured for every user. Add whatever commands you would like to this file. A common thing to do is to alter the path variable to contain a directory with your personal scripts To do this, you just add something like the following to the bottom. Note that you could enter the commands wherever you want in the bashrc, just keep in mind that they will be executed sequentially.

$ PATH=$PATH:/usr/local/abin/:/usr/local/bin/mricron_lx/:/sbin/:/usr/local/bin/:/usr/

˓→local/Scripts/

Of course, you’ll have to enter in your own directories for the PATH to make sense for you. There is no sense in copying and pasting these example PATHS. Like on the command line, you can add as many directories as you want, just remember to separate them with the : character. When you are finished modifying the file. Press ^x to exit, and nano will ask you if you want to save. Say yes. To have the current bash environment use the new bashrc, you can either start a new instance of bash, or run the command

$ source ~/.bashrc

The source command just means to run the file as though you were typing in each command yourself, and not in a new bash instance (the behavior for scripts) If we were to run the bashrc like a script, any variables we set in bashrc would not affect the parent environment. Note: bashrc is a hidden file. It has a . character in front of it. This means that it will not be visible normally. You would need to run the command ls -a to see it in the output.

24 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

When to Use Quotes and Which Quotes to Use

Quoting in bash is used to force bash to interpret characters inside the quotes literally. Often, quotes are used to avoid bash treating spaces as delimiting characters. There are two types of quotes in bash. Double quotes escape spaces, globbing characters, single quotes, and blocks the expansion of the tilde and {}. Double quotes to not escape the $ character, so variable names are expanded normally. For example, if you need to escape spaces but still want bash to expand variable names, you should use double quotes: $ file="a file with spaces.txt"; cp "$file" aFileWithoutSpaces.txt Single quotes escape everything. Use these if you want bash to ignore all special characters. In single quotes, variables won’t be expanded. Single quotes are commonly used when quoting search patters used for grep or awk. This can be because some bash special characters overlap with the grep regular expression characters and cause problems or because you want to grep for a pattern that double quotes would expand. Consider the following: $ echo 'Users should set their $PATH variable' >> README; cat $file | grep '$PATH' If we want to grep for the string $PATH, then we are forced to use single quotes to stop the shell from treating the $ character as special. There are many other use cases for both single and double quotes. You can escape individual characters with the \ character. This works within double quotes as well. If for example, you wanted to have a string with two $ characters where one $ is escaped, and one $ is interpreted normally, then you can use double quotes with a \ preceding the $ you would like to escape. echo "$PATH \$PATH" > file.txt This code will echo both an expanded $PATH variable and the string $PATH to a file called file.txt

Process control (killing hung jobs)

If a process you are running, whether on the GUI or on the command line, becomes unresponsive and you cannot kill it by conventional means. You can use the kill command First find the process ID that you want to stop. The following command will list all the processes being run by your username.

$ ps ux

For example,

1 $ ps ux

2 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

3 dansha 42440.00.0 162256 3604 ? Ss Oct110:00 xterm

4 dansha 42460.00.0 131076 3372 pts/0 Ss Oct110:00 bash

5 dansha 43424.60.1 578252 27800 ? Rl 11:540:00 konsole

6 dansha 43461.00.0 131076 3320 pts/12 Ss 11:540:00 /bin/bash

7 dansha 43690.00.0 578492 16148 pts/0 Sl+ Oct110:01 xfce4-terminal

8 dansha 43750.00.0 22980 896 pts/0 S+ Oct110:00 gnome-pty-helper

9 dansha 43760.00.0 131084 3332 pts/3 Ss+ Oct110:00 bash

10 dansha 44740.00.0 133648 1388 pts/12 R+ 11:540:00 ps ux

11 dansha 47290.00.0 131084 3336 pts/7 Ss+ Oct110:00 bash

12 dansha 49200.00.0 131084 3392 pts/8 Ss+ Oct110:00 bash

13 dansha 51040.00.0 162256 3604 ? Ss Oct110:00 xterm

14 dansha 51060.00.0 131076 3256 pts/11 Ss+ Oct110:00 bash

15 dansha 56170.00.0 162256 3804 ? Ss Oct060:00 xterm

16 dansha 56190.00.0 131176 3568 pts/17 Ss+ Oct060:00 bash

17 dansha 57110.00.0 376040 404 ? Ss Aug310:00 emacs -daemon (continues on next page)

2.3. Introduction to the Linux BASH shell 25 HPC wiki Documentation, Release 2.0

(continued from previous page)

18 dansha 75050.00.0 367324 ? Ss May200:00 /bin/dbus-daemon --

˓→fork --print-pid6 --print-address8 --session

19 dansha 95680.00.0 433608 8796 ? Sl Oct090:00 /usr/libexec/tracker-

˓→store

20 dansha 95720.00.0 304444 3132 ? Sl Oct090:00 /usr/libexec/gvfsd

21 dansha 95760.00.0 286896 5344 ? Sl Oct090:00 /usr/libexec//gvfsd-

˓→fuse /run/user/10441/gvfs -f -o big_writes

22 dansha 123610.00.0 143436 2244 ? S Oct070:00 sshd: dansha@notty

23 dansha 123620.00.0 62932 1912 ? Ss Oct070:00 /usr/libexec/openssh/

˓→sftp-server

24 dansha 124720.00.0 143568 2244 ? S Oct070:00 sshd: dansha@notty

25 dansha 124730.00.0 69328 2148 ? Ss Oct070:00 /usr/libexec/openssh/

˓→sftp-server

26 dansha 156330.00.0 143568 2436 ? S Oct070:00 sshd: dansha@pts/10,

˓→pts/15

27 dansha 156340.00.0 129872 2116 pts/10 Ss+ Oct070:00 /bin/sh

28 dansha 162630.00.0 128944 3076 pts/15 Ss+ Oct070:00 /bin/bash --

˓→noediting -i

29 dansha 180690.00.6 275020 101536 ? Sl Oct045:24 /usr/bin/Xvnc :2 -

˓→desktop mentat208.dccn.nl:2(dansha) -auth /home/language/dansha/.Xauthority -

˓→geometry 1910x10

30 dansha 180780.00.0 115184 1540 ? S Oct040:00 /bin/bash /home/

˓→language/dansha/.vnc/xstartup

31 dansha 181420.00.0 96760 4120 ? S Oct040:00 vncconfig -iconic -

˓→sendprimary=0 -nowin

32 dansha 181430.00.0 159188 6988 ? S Oct040:06 fluxbox

33 dansha 182841.01.9 1461168 318744 ? Ssl Oct04 112:48 /usr/lib64/firefox/

˓→firefox

34 dansha 183130.00.0 28504 768 ? S Oct040:00 dbus-launch --

˓→autolaunch=d172390f877044d1a0919ebec6673565 --binary-syntax --close-stderr

35 dansha 183140.00.0 37012 896 ? Ss Oct040:00 /bin/dbus-daemon --

˓→fork --print-pid6 --print-address8 --session

36 dansha 183410.00.0 160184 2560 ? S Oct040:01 /usr/libexec/gconfd-2

37 dansha 305370.00.0 406336 2536 ? Sl Sep220:15 /usr/bin/pulseaudio -

˓→-start --log-target=syslog

The idea is to match the process ID (PID) with the command name. Any command you run (clicking on an icon is also a command) will have an entry in this table if the command created a process that is still running. For example, to kill firefox process with PID 18284, one uses the command:

$ kill 18284

If firefox still doesn’t close, one could try

$ kill -9 18284

Note: kill -9 is kind of a nuclear option. Don’t use it unless the program won’t close normally with kill.

One could also combine the ps command with grep to find a running process. For example, to find firefox processes, one does:

$ ps ux | grep firefox dansha 46380.00.0 114708 984 pts/12 S+ 11:560:00 grep --color=auto

˓→firefox (continues on next page)

26 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

(continued from previous page) dansha 182841.01.9 1461168 318744 ? Ssl Oct04 112:48 /usr/lib64/firefox/

˓→firefox

Be careful to enter in the right PID. If you enter in the wrong PID, it will kill that program instead. Think of this like ending the wrong process in the windows task manager.

Tip: 1. If you want to save your work in nano without closing the program , press ^o. 2. To read text files without editing them, use the program less. You can search through documents by typing / and then entering the search term you want to look up. Don’t include spaces. You can use this same method to navigate man pages. 3. To see if a program is on your path and where that program is on your path, use the command which.

Odd things to be aware of

These are some little things that have come up with users in the past. I may add more items to this in the future, but these topics are already pretty well addressed on forums. 1. In some terminal programs, accidentally pushing ^s will cause the terminal to lock up. If you notice your terminal is locked up and your not sure why, try pushing ^q 2. Sometimes terminal formatting can get messed up. You may notice that when you type long lines, new characters overwrite characters at the beginning of the line. Also, if you accidentally run cat on a binary file, you may notice your terminal may start displaying nonsense characters when you type. In both of these cases, you might try to run the command:

$ reset

Tip: You may not be able to see what you type, but if you hit enter, type the command, and then hit enter again you might get your terminal back to normal. If that doesn’t work, restart the terminal application.

2.3.3 Exercise: Putting Commands into a Script, and Setting the Script as Exe- cutable

Note: DO NOT just copy-and-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

Preparation

Download a text file using the following command:

$ rm -f gcutError_recon-all.log $ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/linux/

˓→exercise/gcutError_recon-all.log

2.3. Introduction to the Linux BASH shell 27 HPC wiki Documentation, Release 2.0

This data file is an example output file from a freesurfer command submitted to the cluster using qsub. In this simple task we are going to try to extract some information from it using a few commands.

Task

In This task, we’re going to create a script, set it as executable (make it so we can run it), and put it on the path 1. Make a directory called ~/Scripts. If you can’t remember the command to do this, google for it.

Hint: Remember that ~ refers to your home directory.

2. We’re going to start making a scrpt that you will build on in the next exercise. Since a script is really just a text file, open a text editor and then enter the following lines.

#!/bin/bash

# Lines beginning with # are comments. These are not processed by BASH, except in

˓→one special case. # At the beginning of a script, the first line is special. # It tells Linux what interpreter to use, and is called the interpreter directive. # If someone tries to execute a BASH script that does not have the #!/bin/bash

˓→line, # and they are using a different shell (tcsh, ksh, etc), then the script # will not probably not work correctly. # This is because different shells use different syntax. # The syntax of the interpreter directive is a #! followed immediately by the

˓→absolute path of the interpreter you'd like to use. # In most GNU/Linux systems, BASH is expected to live in the /bin folder, so it's

˓→full path is normally /bin/bash.

This is the beginning of every BASH script with some useful commentary added. Comments in BASH are marked with the pound (#) sign. 3. So far this script will do nothing if run because it only contains an interpreter directive and commentary. We are going to add some commands to the script to make it do something. Recall the previous execise where you grep’d over the log file. If we want to save those commands to use again, a script is a very good way to do that. Add the following commands to your script following the commantary:

$ cat gcutError_recon-all.log | grep "Subject[0-9][0-9]" | head -1

4. Save this file as ~/Scripts/logSearch.sh 5. Set the script as executable with the following command

$ chmod +x ~/Scripts/logSearch.sh

Note: This step is extrememly important. Your script will not run unless you tell linux that it can be run. This is a security thing. In the chmod (change mode) command, +x is an option meaning “plus executable,” or set this file to have permission to execute for all users. For more and potentially confusing information, run the command

28 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

$ man chmod

6. Next we will show how you can run your script. In Linux, executable files are treated fairly similary whether they are scripts or binary programs. To run an executable, you generally need to type it’s name in, and it will execute. You only need to make sure BASH knows where to look for the executable you want to run. There are different ways to do so: • You can run the executable by typing in the full (absolute) path of the script. • You can use the path relative to your current working directory. • You can add the location of the executable to your $PATH environment variable. Try to run your script by first using the relative path, then the absolute path. Raise your hand, if you don’t understand this instruction.

Hint: The character . refers to your current directory. In BASH, you need to indicate that you want to run an executable in your current directory by prefacing the command with ./ For example, if you want to executa a script, myscript.sh in your current directory, you would type ./myscript.sh.

7. Now that you’ve run your script using the absolute and relative paths, try to add ~/Scripts to your $PATH environment variable.

Hint: • Checkout this useful information • Remember that you need to add directories to your path, not files. When you type a command and hit enter, BASH will search all the directories on your path for a file matching what you typed. Do not add files directly to your path. BASH will not be able to find them.

8. See that you can run the script just by typing the name of it now! WOW!! When an executable file is on your path, you can just type its name without giving any information about its location in the file system. If you specify the path of a file in the command, i.e by prepending a ./ or /the/ path/to/file to the file name, BASH will ignore your path variable and look in the location you specify. The take away from all this is that instead of typing

$ cat gcutError_recon-all.log | grep "Subject[0-9][0-9]" | head -1

Every time you want to run this command, you can just run the script you made in this exercise. As you might be thinking already, you can add as many lines as you want to a script. If you open the script back up with your favorite text editor, you can add anything you want to extend its functionality.

2.3.4 Exercise: the if -statement and for-loop

Introduction

In this exercise we will be extending script.sh by adding some BASH flow control constructions. We will be working with if and for. The if-statement and the for-loop are probably the most common flow control tools in

2.3. Introduction to the Linux BASH shell 29 HPC wiki Documentation, Release 2.0 the bash hackers toolbox. The overall goal of this lengthy exercise is to introduce for-loops and if-statements and show you a common use case in which they are put together with what you have learned in pervious sessions for actual works. As an example, we will show you how to search for a specific pattern in a collection of log files and print out certain information from the log files given a condition. The exercise consists of two main sections broken into subtasks. The two main sections focus respectively on if and for, and the subtasks are designed to introduce these tools and illustrate their utility.

Task 1: simple for loop

Background

We will construct a simple for-loop to demonstrate how it works. The for-loop works by iterating over a list of items and executing all the commands in the body once for each item in the list. The general form is: for variable-name in list-of-stuff; do commands more commands done

You can add as many commands as you like. BASH will loop through the commands in the body of the loop as many times as there are items in your list. You can see [the wiki](language.md) for more information.

Your Task

1. Add a list of items to this for-loop and see what happens. A list can be a list of files, strings, numbers, anything.

for i in INSERT-LIST-HERE; do echo $i done

Replace the INSERT-LIST-HERE with $(ls ${HOME}) and see how it changes i to the next item on the list each time it iterates. 2. In this next one, try to add any command you want to the body of the for-loop

for i in{01..10}; do INSERT-COMMANDS-HERE INSERT-MORE-COMMANDS-HERE-IF-YOU-LIKE done

Tip: Bash takes a range of items within {} and expands it before running any commands. For example, {01. .05} will expand to 01 02 03 04 05. You can use letters or numbers. See this link for more information.

The main things to remember are that the variable name, list and commands are totally arbitrary and can be whatever you like as long as you keep the correct syntax. Also note that you can have any number of items in the list as you want, you can set the variable name to whatever you want, and you can use any commands you want. You don’t even need to reference the variable in the body. For example, try running

30 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

for i in{01..05}; do echo 'hello world!' done

Hint: Notice the syntax. The first line ends in do, the next commands are indented, and done, the keyword which ends the loop, is at the same indentation level as the keyword for, which begins the loop. This is how all your for loops should look.

Task 2: Use the for loop in a BASH script

Background

We will extend the functionality of our current script with the for-loop. For this exercise, we deal with the common scenario of needing to search through a collection of log files for specific information.

Preparation

Start by downloading the log files we’ll be using. Move into a directory you’d like to work in and run this command to download and untar the logfiles.

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/bash/logs.tgz $ tar xvf logs.tgz

Now open script.sh and change your grep command to the one you see below. The -o option tells grep to print ONLY the matching pattern, and not the rest of the line around it. This will be useful later in the task and in general.

#!/bin/bash

# Lines beginning with # are comments. These are not processed by BASH, except in one

˓→special case. # At the beginning of a script, the first line is special. It tells Linux what

˓→interpreter to use, and is called, accordingly, the _interpreter directive.

grep -o "Subject[0-9][0-9]" gcutError_recon-all.log | head -1

Your task

Using this command as a starting point, create a for-loop to grep the Subject ID of every log file we’ve downloaded. To accomplish this goal you will need to do the following: 1. Create a for loop which iterates over a list consisting of the log files. 2. Modify the grep command to search through the current log file and not “gcutError_recon-all.log”. 3. Run your script. The structure will be something like this:

for var in list-of-logs; do grep -o search-term file-to-search | head -1 done

2.3. Introduction to the Linux BASH shell 31 HPC wiki Documentation, Release 2.0

Note: Always remember to include all the special keywords: for , in , ; , do , and done. If you don’t remember these, you might not get an error, but your loop definitely won’t run.

Task 3: simple if statement

Background

Often in programming, you want your program or script to do something if certain conditions are met, and other things if the conditions are not met. In BASH, as well as many other languages, a very common way of exerting this type of control over your program is an if-statement. The purpose of if is to test if a command returns an exit status of 0 (zero) or not 0, and then run some commands if the exit staus is 0. You can also say to run commands if the exit status is not 0. This is what the keyword else means. Recall that, in BASH, the if-statement syntax is

if command-returns-true; then run these commands else run-these-commands-instead fi

true means exit status 0 (BASH tracks every process’ exit status), and the else portion is optional. Any non-zero exit status would be not true, i.e false. Note: For the gory details, refer back to the slides, the wiki, or suffer the agony of this fairly exhasutive treatment.

Your task

1. Modify the following if-statement code using the command true.

if INSERT-COMMAND-TO-EVALUATE; then INSERT-COMMANDS-TO-RUN-IF-TRUE INSERT-MORE-COMMANDS-TO-RUN-IF-TRUE else INSERT-COMMANDS-TO-RUN-IF-FALSE INSERT-MORE-COMMANDS-TO-RUN-IF-FALSE fi

Tip: true is a command which does nothing except return exit status 0, thus it always evaluates to true! The description in the man page is good for a chuckle. You’ll want to make sure you put true as the command to evaluate. Remember to fill in the other commands too. The other commands can be whatever you like.

2. Now try using the command false instead of true.

Note: Now the else portion of the code will be evaluated while the part before the else keyword will not be evaluated. Use the same template if-statement as you did in subtask 1.

32 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Task 4: Comparitive statements

Background

In this task, you will extend the power of if by using it with comparison operators of BASH. Task 3 demonstrated how if-statements work, but their main use in scripting is testing if a comparison evaluates to true or false. This complicates the syntax. For comparisons, you need to use a separate command called test. In BASH, the most commonly seen form of test is [[ things-to-compare ]].

Tip: You will also see the form [ things-to-compare ], which is simply a less featured version of [[ ]]. They are both versions of the command test. In general, you should always use the [[ ]] form. You can look to this guide for the a good explanation of test [] and [[ ]].

Your Task

1. Modify the following if-statement structure to test if the number on the left is less-than the number on the right.

if [[3 INSERT-OPERATOR4]]; then echo "3 is less than 4" else echo "4 is not greater than 3" fi

Tip: Numerical comparison operators to use with [[ ]] are -lt, -gt, -ge, -le, -eq, and -ne. They mean, less-than, greater-than, greater-or-equal, etc.

Now test if 3 is greater than 4 by using a different comparison operator. 2. Try the same command but with variables now instead of numbers. Modify this code, remembering to set values for variables num1 and num2.

num1= num2= if [[ $num1 INSERT-OPERATOR $num2]]; then INSERT-COMMANDS else INSERT-COMMANDS fi

Note: BASH only understands integers. Floating point arithmetic requires external programs (like bc).

3. Now we will perform string comparisons. The main purpose of this is to see if some variable is set to a certain value. Strings use different comparison operators than integers. For strings we use ==, >, <, and !=. By far the most common operators are == and != meaning respectively equal and not equal.

2.3. Introduction to the Linux BASH shell 33 HPC wiki Documentation, Release 2.0

string=

if [[ $string== "A String"]]; then echo "strings the same" else echo "strings are not the same" fi

Note: This one place where the difference between [[ ]] and [] becomes evident. With [] you will have to escape the < and > characters because they are special characters to the shell. With [[ ]] you don’t have to worry about escaping anything. Recall in BASH that we use \ to tell BASH to process the next character literally.

Note: If a string has a space in it the space has to be escaped somehow. One way of doing this is by using either single or double quotes.

Task 5: Put if and for together

Background

We will now return to our script with the for-loop and extend the functionality by adding an if-statement inside of the for-loop. In this task, we will find the amount of time each script which generated each logfile ran. We will print the run time and the logfile name to the screen if the runtime is below 9 hours. I’ve broken this rather large task into small steps. Raise your hand if you get lost! This one’s hard.

Your Task

1. In each logfile the “run-time” is recorded. It is the amount of time the freesurfer script which generated the logfile ran. Open your script and modify the grep command to search for the “run-time” instead of the subject ID. You’ll need to remove the -o flag now because we’ll need the full line.

#an example for file in list; do grep SEARCH-PATTERN $file done

After correctly modifying grep and running the script, you should have a bunch of lines output to the screen. They’ll all be of the form:

#@#%# recon-all-run-time-hours 5.525 #@#%# recon-all-run-time-hours 10.225 ...

If you get output like this, move on to 2.

34 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2. Restrict this output to ONLY numbers less than 10. In other words, find a search pattern that is only sensitive to one digit followed by a decimal. Then find a way to restrict the output further so that only the whole number remains, i.e 8.45 becomes simply 8. If you spend more than 10 minutes on this, look to the solution and move on to 3!. This is a hard one, so I provide lots of hints.

Tip: 1. You only need grep for this, not if. Think about piping multiple grep commands together and of using regexes. 2. The key to this question is getting the right regexp. There are a few ways you could do this. 3. Remember that “space” is a character. 4. If you want to search for a literal . character, you’ll have to escape it with grep, i.e \. and not .. 5. Be careful not to accidentally return only the second digit of a two digit number. 6. In grep you don’t negate the items inside [] with ! as you do with wildcards, instead you use ^, i.e [^0-9], to mean NOT a number from 0 to 9 instead of [!0-9] 7. Finally, it’s good practice in grep to put your search term in single or double quotes.

3. grep should be returning one digit numbers or nothing at all. This is what we want! In step 3, we will capture the output and save it to a variable. We will use this variable later for a numerical comparison involving if. Recall command substitution. If you want to save the output of a command as a variable, use the syntax:

var=$(MY-COMMANDS-HERE)

Insert your command into the parentheses and then insert that line in place of your current grep pipeline. 4. Now add an if-statement to the body of the for-loop and create a comparison, testing if the value grep returned is less than 9. If the value is less than 9, we want to print the name of the logfile and the variable value to the screen.

for file in list; do var=$(MY-GREP-PIPELINE) if [[ $var INSERT-OPERATOR INSERT-VALUE]]; then DO SOMETHING fi done

If you’ve done this correctly, you may notice an odd result. Even if $var is empty, your comparison will always evaluate to less than 9?! If this odd outcome is the same as yours, check the solution and then move onto subtask 5!

Tip: An excellent trick is to echo the commands you will run before you run them. If, for example, you are (as you should be) worried that your search patterns are a bit too liberal, you can see what the loop will actually do by putting it in double-quotes and adding echo before it. Observe:

for file in list; do var=$(MY-GREP-PIPELINE) echo"if [[ $var INSERT-OPERATOR INSERT-VALUE ]]; then DO SOMETHING (continues on next page)

2.3. Introduction to the Linux BASH shell 35 HPC wiki Documentation, Release 2.0

(continued from previous page) fi" done

Instead of running the commands, you’ve now told the for-loop to echo what will actually be run to the screen. This is an important step in checking your own code for errors before you run it.

5. The reason $var is always less than 9, even when nothing is assigned to it is because empty strings evaluate to 0! To get around this you can add extra conditions to your if-statement. Add an extra comparison that will test if $var is greater than zero. The syntax is like so:

for file in list; do var=$(MY-GREP-PIPELINE) if [[ $var INSERT-OPERATOR INSERT-VALUE&& $var INSERT-OPERATOR INSERT-VALUE

˓→]]; then DO SOMETHING fi done

This will test if both conditions evaluate to true, and then run the command if both are true. You could also create a comparison using logical or with ||. As a result, if the run time is less than 9 hours and greater than 0 hours, we will print the log and the run time to the screen. Good work!

Note: For an even better solution, you can use what are called unary operators. These are detailed among the agonies of this fairly exhasutive treatment. They test if variables are empty strings, if files exist, etc. Note that this guide uses the [] form of test, but you can use everything described there with the [[ ]] form as well.

Solutions

Task 5 question 2

$ grep "run-time" $file | grep -o " [0-9]\." | grep -o "[0-9]"

Task 5 question 4 for file in *log; do var=$(grep "run-time" $file | grep -o " [0-9]\." | grep -o "[0-9]") if [[ $var -lt9]]; then echo"$file: $var" fi done

Task 5 question 5

36 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

for file in *log; do var=$(grep "run-time" $file | grep -o " [0-9]\." | grep -o "[0-9]") if [[ $var -lt9&& $var -gt0]]; then echo"$file: $var" fi done

2.4 The HPC cluster

2.4.1 Obtaining an user account

You should receive a pair of username/password after following the ICT check-in at DCCN. If you do not have a account, ask the TG helpdesk.

Note: The user account here is NOT the account (e.g. u-number) given by the Radboud University.

2.4.2 Accessing the cluster

Getting access to the HPC cluster

SSH login with Putty

Follow the steps below to connect to one of the cluster’s access nodes, using the SSH. Screenshots of the four steps are shown below: 1. start putty on on the Windows desktop 2. configure putty for connecting to, e.g., mentat001.dccn.nl 3. login with your username and password 4. get a test-based virtual terminal with a shell prompt

SSH logout

You can logout the system by either closing the Putty window or typing the command exit in the virtual terminal.

VNC for graphic desktop

Note: For the first-time user, type

$ vncpasswd in the putty terminal to protect your VNC server from anonymous access before following the instructions below.

Firstly, start the VNC server by typing the following command in the putty terminal.

2.4. The HPC cluster 37 HPC wiki Documentation, Release 2.0

38 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 39 HPC wiki Documentation, Release 2.0

$ vncmanager

Follow the step-by-step instructions on the screen to initiate a VNC server. See the screenshots below as an example. 1. start a new VNC server 2. select a host 3. choose resolution 4. make fullscreen 5. select windows manager 6. VNC server started with a display endpoint In the screenshots above, we have started a VNC server associated with a display endpoint mentat002.dccn. nl:56. To connect to it, we use a VNC client called TigerVNC Viewer. Follow the steps below to make the connec- tion:

Note: The display endpoint mentat002.dccn.nl:56 is just an example. In reality, you should replace it with a different endpoint given by the vncmanager.

1. open the TigerVNC Viewer (double-click the icon on the desktop) 2. enter the display endpoint (mentat002.dccn.nl:56) as the VNC server 3. enter the authentication password you set via the vncpasswd command 4. get the graphical desktop of the access node

40 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 41 HPC wiki Documentation, Release 2.0

42 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 43 HPC wiki Documentation, Release 2.0

44 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Disconnect VNC server

To disconnect the VNC server, simply close the TigerVNC-viewer window in which the graphical desktop is displayed. The VNC server will remain available, and can be reused (re-connected) when you need to use the graphical desktop again in the future.

Warning: DO NOT logout the graphical desktop as it causes the VNC server become unaccessible afterwards.

Terminate VNC server

Since the graphical windows manager takes significant amount of resources from the system, it is strongly recom- mended to terminate the VNC server if you are not actively using it. Terminating a VNC server can be done via the vncmanager command. The steps are shown in the screenshots below: 1. stop a VNC server 2. choose a server to be stopped 3. confirm and stop the server

Access from outside of DCCN

If you are at home or on travel, or connecting your personal laptop to the edurom network, you are not allowed to connect to the access nodes directly as they are in the DCCN network protected by a firewall. In this case, you need to make the connection indirectly via one of the following two ways:

2.4. The HPC cluster 45 HPC wiki Documentation, Release 2.0

46 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Using eduVPN

EduVPN is a virtual private network service provided by SURF allowing secure access to institute’s protected network, services and systems. It is the most straight forward way of accessing the HPC cluster from outside of the DCCN network; but it requires a valid RU/RUMC account prefixed with u or e (a.k.a. the u/e-number). If you do have such type of RU/RUMC account, you can follow the instruction to setup the eduVPN. After you start the eduVPN connection, your computer is “virtually” part of the DCCN network. With that you can connect directly to the HPC cluster as accessing from inside of DCCN.

Using SSH tunnel

A SSH gateway named ssh.dccn.nl is provided for setting the SSH tunnels. When setting up a tunnel for con- necting to a target service behind the firewall, one needs to choose a local network port that is still free for use on your desktop/laptop (i.e. the Source port) and provides the network endpoint (i.e. the Destination) referring to the target service.

Tip: This technique can also be applied for accessing different services protected by the DCCN firewall.

Contents

• Instructions in video • Putty login via SSH tunnel • VNC via SSH tunnel (Windows) • VNC via SSH tunnel (Linux/Mac OSX)

Instructions in video

The following screencast will guide you through the steps of accessing the cluster via the SSH tunnel.

2.4. The HPC cluster 47 HPC wiki Documentation, Release 2.0

Putty login via SSH tunnel

In this example, we choose Source port to be 8022. The Destination referring to the SSH server on men- tat001 should be mentat001:22. Follow the steps below to establish the tunnel for SSH connection: 1. start putty on the Windows desktop 2. configure putty for connecting to the SSH gateway ssh.fcdonders.nl 3. configure putty to initiate a local port 8022 for forwarding connections to mentat001:22 4. login the gateway with your username and password to establish the tunnel Once you have logged in the gateway, you should keep the login window open; and make another SSH connection to the local port as follows: 1. start another putty on the Windows desktop 2. configure putty for connecting to localhost on port 8022. This is the port we initiated when establishing the tunnel. 3. login with your username and password 4. get the virtual terminal with a shell prompt. You should see the hostname mentat001 showing on the prompt.

VNC via SSH tunnel (Windows)

In this example, we choose Source port to be 5956. We also assume that a VNC server has been started on mentat002 with the display number 56. The Destination referring to the VNC server should be mentat002:5956.

Note: The display number 56 is just an example. In reality, you should replace it with a different number assigned by the vncmanager. Nevertheless, the network port number is always the display number plus 5900.

Follow the steps below to establish the tunnel for VNC connection: 1. start putty on the Windows desktop 2. configure putty for connecting to the SSH gateway ssh.fcdonders.nl 3. configure putty to initiate a local port 5956 for forwarding connections to mentat002:5956 4. login the gateway with your username and password to establish the tunnel Once you have logged in the gateway, you should keep the login window open; and maken a VNC client connection to the local port as follows: 1. open the TigerVNC application 2. enter the display endpoint (localhost:5956) as the VNC server 3. enter the authentication password you set via the vncpasswd command 4. get the graphical desktop of the access node

48 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 49 HPC wiki Documentation, Release 2.0

50 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 51 HPC wiki Documentation, Release 2.0

52 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 53 HPC wiki Documentation, Release 2.0

VNC via SSH tunnel (Linux/Mac OSX)

In this example, we choose Source port to be 5956. We also assume that a VNC server has been started on mentat002 with the display number 56. The Destination referring to the VNC server should be mentat002:5956.

Note: The display number 56 is just an example. In reality, you should replace it with a different number assigned by the vncmanager. Nevertheless, the network port number is always the display number plus 5900.

Follow the steps below to establish the tunnel for VNC connection: 1. open a terminal application On Linux, this can be either gnome-terminal on GNOME desktop environment, xfce4-terminal on the XFCE4, or konsole of the KDE. On Mac, the Terminal app can be found in the Other group under the app lanchpad. 2. set up the SSH tunnel Use the following command to create the SSH tunnel. Note that the $ sign is just an indication of your terminal prompt, it is not the part of the command. The username xxxyyy should also be your actual DCCN account name in practice.

$ ssh -L 5956:mentat002:5956 [email protected]

A screenshot below shows an example: Once the connect is set, you should leave the terminal open. If you close the terminal, the tunnel is also closed. You can now make a connection to your VNC session through this SSH tunnel. 3. open the TigerVNC application

54 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 55 HPC wiki Documentation, Release 2.0

56 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 57 HPC wiki Documentation, Release 2.0

58 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

4. enter the display endpoint (localhost:5956) as the VNC server 5. enter the authentication password you set via the vncpasswd command 6. get the graphical desktop of the access node

2.4.3 Using the cluster

Running computations on the Torque cluster

What is the Torque cluster?

The Torque cluster is a pool of high-end computers (also referred to as compute nodes) managed by a resource manager called Torque and a job scheduler called Moab. Instead of allowing users to login to one computer and run computations freely, user submit their computations in forms of jobs to the Torque cluster. A sketch in the picture below summarises how jobs are being managed by the Torque server and scheduled by its companion, the Moab server, to perform computations on the compute nodes in the cluster. Every job is submitted to the Torque cluster with a set of resource requirement (e.g. duration of the computation, number of CPU cores, amount of RAM, etc.). Based on the requirement, jobs are arranged internally in job queues. The Moab scheduler is responsible for prioritising jobs and assign them accordingly to compute nodes on which the jobs’ requirements are fulfilled. The system also guarantees dedicated resources for the computation. Thus, interference between different computations is minimised, resulting in more predictable job completion time.

2.4. The HPC cluster 59 HPC wiki Documentation, Release 2.0

60 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

2.4. The HPC cluster 61 HPC wiki Documentation, Release 2.0

62 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Fig. 2: Figure: a simplified view of the torque cluster architecture.

2.4. The HPC cluster 63 HPC wiki Documentation, Release 2.0

Resource sharing and job prioritisation

For optimising the utilisation of the resources of the Torque cluster, certain resource-sharing and job prioritisation policies are applied to jobs submitted to the cluster. The implications to users can be seen from the the three aspects: job queues, throttling policies for resource usage and job prioritisation.

Job queues

In the cluster, several job queues are made available in order to arrange jobs by resource requirements. Those queues are summarised in the table below. Queues are mainly distinguished by the wall time and memory limitations. Some queues, such as matlab, vgl and interactive, have their own special purpose for jobs with additional resource require- ments.

queue routing max. walltime per max. memory per special feature job prior- name queue job job ity matlab N/A 48 hours 256 GB matlab license normal vgl N/A 8 hours 10 GB VirtualGL capabil- normal ity bigscratch N/A 72 hours 256 GB local disk space normal short N/A 2 hours 8 GB normal veryshort N/A 20 minutes 8 GB normal long automatic 24 hours 8 GB normal batch automatic 48 hours 256 GB normal verylong automatic 72 hours 64 GB normal interactive automatic 72 hours 64 GB user interaction high lcmgui N/A 72 hours 64 GB interactive high LCModel

At the job submission time, user can specify to which queue the job should be placed in the system. Alternatively, one could simply specify the wall time and memory required by the job and let the system pick up a most proper queue automatically for the job. The second approach is implemented by the automatic queue behaving as a router to a destination queue.

Throttling policies for resource usage

In the Torque cluster at DCCN, throttle policies are applied to limit the amount of resources an user can allocate at the same time. It is to avoid resources of the entire cluster being occupied by a single user. The policies are defined in two scopes:

Queue-wise policies

For every job queue, the total number of runnable and queue-able jobs per user are throttled. In the table below, the max. runnable jobs specifies the maximum number of running jobs a user is allowed to have in a queue at a given time; while the max. queueable jobs restricts the total number of jobs (including idle, running and blocked jobs) a user is allowed to have.

64 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

queue name max. runnable jobs max. queue-able jobs matlab 400 2000 bigscratch 400 2000 short 400 2000 veryshort 400 2000 long 400 2000 batch 400 2000 verylong 400 2000 vgl 2 5 interactive 2 4 lcmgui 2 4

For most of queues, the number of runnable and queue-able jobs are set to 300 and 2000, respectively. However, more restricted policies are applied to jobs in the vgl, interactive and lcmgui queues. For jobs in the vgl queue, the maximum runnable and queue-able jobs are set to 2 and 5, respectively; while they are 2 and 4 for jobs in the interactive and the lcmgui queues. This is to compensate for the facts that vgl jobs consume lots of the network bandwidth; and interactive and lcmgui jobs always have the highest priority to start. Furthermore, the lcmgui jobs are always assigned to the node on which the LCModel license is installed.

Cluster-wise policies

The cluster-wise throttling is to limit the total amount of resources a single user can occupy at the same time in the cluster. The three upper-bound (cluster-wise) limitations are: • 400 jobs • 660 days processing (wall)time • 1 TB memory The cluster-wise policies overrule the queue-wise policies. It implies that if the resource utilisation of your current running jobs reaches one of the cluster-wise limitations, your additional jobs have to wait in the queue even there are still available resources in the cluster and you are not rearching the queue-wise limitations.

Job prioritisation

Job priority determines the order of waiting jobs to start in the cluster. Job priority is calculated by the Moab scheduler taking into account various factors. In the cluster at DCCN, mainly the following two factors are considered. 1. The waiting time a job has spent in the queue: this factor will add one additional priority point to jobs waiting for one additional minute in the queue. 2. Queue priority: this factor is mainly used for boosting jobs in the interactive queue with an outstanding priority offset so that they will be started sooner than other types of jobs. The final job priority combining the two factors is used by the scheduler to order the waiting jobs accordingly. The first job on the ordered list is the next to start in the cluster. Note: Job priority calculation is dynamic and not complete transparent to users. One should keep in mind that the cluster does not treat the jobs as “first-come first-serve”.

2.4. The HPC cluster 65 HPC wiki Documentation, Release 2.0

Job management workflow

The Torque system comes with a set of command-line tools for users to manage jobs in the cluster. These tools are generic and can be utilised for running various types of analysis jobs. The picture on the left shows a general job management lifecycle when running your computations in the cluster. The three mostly used tools during the job management lifecycle are: qsub for submitting jobs to the cluster, qstat for checking jobs’ status in the cluster, and qdel for cancelling jobs. The usage of them are given below.

Batch job submission

The qsub command is used to submit jobs to the Torque job manager. The first and simplest way of using qsub is pipelining a command-line string to it. Assuming that we want to display the hostname of the compute node on which the job will run, we issue the following command:

$ echo '/bin/hostname -f' | qsub -l 'nodes=1:ppn=1,mem=128mb,walltime=00:10:00'

Here we echo the command we want to run (i.e. /bin/hostname -f) as a string, and pass it to qsub as the content of our job. In addition, we also request for resources of 1 processor with 128 megabytes RAM for a walltime of 10 minute, using the -l option. In return, you will receive an unique job identifier similar to the one below.

6278224.dccn-l029.dccn.nl

It is “the” identifier to be used for tracing the job’s progress and status in the cluster. We will show it later; for the moment, we continue with a different way of using the qsub command. It is more realistic that our computation involves a set of commands to be executed sequentially. A more handy way is to compose those commands into a BASH script and hand the script over to the qsub command. Assuming we have made a script called my_analysis.sh right in the present working directory (i.e. PWD), we can then submit this script as a job via the following command:

$ qsub -l 'nodes=1:ppn=1,mem=128mb,walltime=00:10:00' ${PWD}/my_analysis.sh

It is very often that the same analysis needs to be repeated on many datasets, each corresponds to, for example, a subject. It would be smart to implement the bash script with additional arguments to switch between datasets. Assuming that the my_analysis.sh is now implemented to take one argument as the subject index, submitting the script to run on the dataset of subject 001 would look like the example below:

$ echo" ${PWD}/my_analysis.sh 001" | qsub -N 's001' -l 'nodes=1:ppn=1,mem=128mb,

˓→walltime=00:10:00'

Note: The command above for passing argument to script is actually a workaround as qsub (of currently installed version) does not provide options to deal with the command arguments.

Interactive computation in text mode

It is possible to acquire a Linux shell of an compute node for running computations interactively. It is done by submitting the so-called interactive jobs. To submit an interactive job, one adds an additional -I option of the qsub command:

66 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Fig. 3: Figure: the generic job management workflow.

2.4. The HPC cluster 67 HPC wiki Documentation, Release 2.0

$ qsub -I -l 'nodes=1:ppn=1,mem=128mb,walltime=00:10:00'

In few seconds, a message similar to the one below will show up in the terminal.

1 qsub: waiting for job 6318221.dccn-l029.dccn.nl to start

2 qsub: job 6318221.dccn-l029.dccn.nl ready

3

4 ------

5 Begin PBS Prologue Tue Aug5 13:31:05 CEST 2014 1407238265

6 Job ID: 6318221.dccn-l029.dccn.nl

7 Username: honlee

8 Group: tg

9 Asked resources: nodes=1:ppn=1,mem=128mb,walltime=00:10:00,neednodes=1:ppn=1

10 Queue: interactive

11 Nodes: dccn-c351

12 End PBS Prologue Tue Aug5 13:31:05 CEST 2014 1407238265

13 ------

14 honlee@dccn-c351:~

The shell prompt on line 14 shows that you are now logged into an compute node (i.e. dccn-c351). You can now run the computation interactively by typing a command after the prompt. Note: the resource usage of interactive job is also monitored by the Torque system. The job will be killed (i.e. you will be kicked out the shell) when the computation runs over the amount of the resources requested at the job submission time.

Interactive computation in graphic mode

Inteactive computation in graphic mode is actually achieved by submitting a batch job to run the graphical application on the execute node; but when the application runs, it shows the graphic interface remotely on the cluster’s access node. Therefore, it requires you to connect to the cluster’s access node via VNC. Assuming we want to run FSL interactively through its graphical menu, we use the following commands:

$ xhost + $ echo"export DISPLAY= ${HOSTNAME}${DISPLAY}; fsl" | qsub -q interactive -l

˓→'nodes=1:ppn=1,mem=128mb,walltime=00:10:00'

The first command allows graphic interfaces on any remote host to be displayed on the access node. The second command submit a job to firstly set the compute node to forward graphic interfaces to the access node before launching the FSL executable.

Checking job status

Every submitted job in the cluster is referred by an unique identifier (i.e. the job id). It is “the” reference allowing system and users to trace the progress of a particular job in the cluster. The system also maintains a set of historical jobs (i.e. jobs finished in last 12 hours) that can be also queried by users using the qstat command. To get a list of jobs submitted by you, simply run

$ qstat

If you have jobs in the system, you will get a table similar to the one below:

68 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

job id Name User Time Use S Queue ------6318626.dccn-l029 matlab honlee 00:00:00 C matlab 6318627.dccn-l029 matlab honlee 00:00:00 C matlab 6318630.dccn-l029 STDIN honlee 00:00:01 C matlab 6318631.dccn-l029 STDIN honlee 00:00:01 C interactive

In the able, the colume Time Use indicates the CPU time utilisation of the job, while the job status is presented in the column S with a flag of a capital letter. Possible job-status flags are summarised below: • H: job is held (by the system or the user) • Q: job is queued and eligible to run • R: job is running • E: job is exiting after having run • C: job is completed after having run

Tip: There are many options supported by qstat. For example, one can use -i to list only jobs waiting in the queue. More options can be found via the online document using man qstat.

Cancelling jobs

Cancelling jobs in the cluster is done with the qdel command. For example, to cancel a job with id 6318635, one does

$ qdel 6318635

Note: You cannot cancel jobs in status exiting (E) or completed (C).

Output streams of the job

On the compute node, the job itself is executed as a process in the system. The default STDOUT and STDERR streams of the process are redirected to files named as .o and . e, respectively. After the job reachers the complete state, these two files will be produced on the file system.

Tip: The STDOUT and STDERR files produced by job usually provide useful information for debugging issues with the job. Always check them first when your job is failed or terminated unexpectedly.

Specifying resource requirement

Each job submitted to the cluster comes with a resource requirement. The job scheduler and resource manager of the cluster make sure that the needed resources are allocated for the job. To allow the job to complete successfully, it is important that a right and sufficient amount of resources are specified at the job submission time.

2.4. The HPC cluster 69 HPC wiki Documentation, Release 2.0

When submitting jobs with the qsub command, one uses the -l option to specify required resources. The value of the -l option follows certain syntax. Detail of the syntax can be found on the Torque documentation. Hereafter are few useful, and mostly used examples for jobs requiring:

Warning: The examples below only show the option of the qsub command for resource specification (-l); therefore they are NOT complete commands. You need to make the command complete by adding either a -I option for an interactive job or passing a script to be run as a batch job.

1 CPU core, 4 gigabytes memory and 12 hours wallclock time

$ qsub -l 'walltime=12:00:00,mem=4gb' job.sh

The requirement of 1 CPU is skipped as it is by default to be 1.

4 CPU cores on a single node, 12 hours wallclock time, and 4 gb memory

$ qsub -l 'nodes=1:ppn=4,walltime=12:00:00,mem=4gb' job.sh

Here we explicitly ask 4 CPU cores to be on the same compute node. This is usually a case that the application (such as multithreading of MATLAB) can benefit from multiple cores on a (SMP) node to speed up the computation.

1 CPU core, 500gb of free local “scratch” diskspace in /data, 12 hours wallclock time, and 4 gb memory

$ qsub -l 'file=500gb,walltime=12:00:00,mem=4gb' job.sh

Here we explicitly ask for 500gb of free local diskspace located in /data on the compute node. This could for instance be asked for when submitting an fmriprep job that requires lots of local diskspace for computation. The more jobs are running, the longer it can take for torque to find a node with enough free diskspace to run the job. Max to request for is 3600gb.

Note: In case you use more than the requested 500gb there will be no penalty. Diskspace is monitored, but your job won’t fail if the requested diskspace is “overused”, as long as diskspace is available. Of course if no more diskspace is available your job will fail.

1 Intel CPU core, 4 gigabytes memory and 12 hours wallclock time, on a node with 10 Gb network connectivity

$ qsub -l 'nodes=1:intel:network10GigE,walltime=12:00:00,mem=4gb' job.sh

Here we ask the allocated CPU core to be on a node with properties intel and network10GigE.

70 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

1 AMD EPYC7351 CPU core, 4 gigabytes memory and 12 hours wallclock time

$ qsub -l 'nodes=1:amd:epyc7351,walltime=12:00:00,mem=4gb' job.sh

Here we ask the allocated CPU core to be on a node with properties amd (CPU vendor) and epyc7351 (CPU model).

4 CPU cores, 12 hours wallclock time, and 4 gb memory. The 4 CPU cores may come from different nodes

$ qsub -l 'procs=4,walltime=12:00:00,mem=4gb' job.sh

Here we use procs to specify the amount of CPU cores we need, but not restricting to a single node. In this scenario, the job (or the application the job runs) should take care of the communication between the processors distributed on many nodes. This is typically for the MPI-like applications.

1 GPU with minimal cuda capability 5.0, 12 hours wallclock time, and 4 gb memory

$ qsub -l 'nodes=1:gpus=1,feature=cuda,walltime=1:00:00,mem=4gb,reqattr=cudacap>=5.0'

Here we ask for a 1 GPU on a node with the (dynamic) attribute cudacap set to larger or equal to 5.0. The feature=cuda requirement allows the system to make use of a standing reservation if there is still space avail- able in the reservation.

Note: Currently there are 9 Nvidia Tesla P100 GPUs available in the entire cluster. More GPUs will be added to the cluster in the future.

Estimating resource requirement

As we have mentioned, every job has attributes specifying the required resources for its computation. Based on those attributes, the job scheduler allocates resources for jobs. The more precise these requirement attributes are given, the more efficient the resources are used. Therefore, we encourage all users to estimate the resource requirements before submitting massive jobs to the cluster. The walltime and memory requirements are the most essential ones amongst others. Hereafter are three different ways to make estimations of those two requirements.

Note: Computing resources in the cluster are reserved for jobs in terms of size (e.g. amount of requested memory and CPU cores) and duration (e.g. the requested walltime). Under-estimating the requirement causes job to be killed before completion and thus the resources have been consumed by the job were wasted; while over-estimating blocks resources from being used efficiently.

1. Consult your colleages If your analysis tool (or script) is commonly used in your research field, consulting with your colleagues might be just an efficient way to get a general idea about the resource requirement of the tool. 2. Monitor the resource consumption (with an interactive test job) A good way of estimating the wall time and memory requirement is through monitoring the usage of them at run time. This approach is only feasible if you run the job interactively through a graphical interface. Nevertheless,

2.4. The HPC cluster 71 HPC wiki Documentation, Release 2.0

it’s encouraged to test your data analysis computation interactively once before submitting it to the cluster with a large amount of batch jobs. Through the interactive test, one could easily debug issues and measure the resource usage. Upon the start of an interactive job, a resource comsumption monitor is shown on the top-right corner of your VNC desktop. An example is shown in the following screenshot:

The resource monitor consists of three bars. From top to bottom, they are: • Elapsed walltime: the bar indicates the elasped walltime consumed by the job. It also shows the remaining walltime. The walltime is adjusted accordingly to the CPU speed. • Memory usage: the bar indicates the current memory usage of the job. • Max memory usage: the bar indicates the peak memory usage of the job. 3. Use the job’s epilogue message (a trial-and-error approach) The wall time and memory requirements can also be determined with a trial procedure in which the user submits a test job to the cluster with a rough requirement. In the job’s STDOUT file (i.e. . o), you will see an Epilogue message stating the amount of resources being used by the job. In the snippet below, this is shown on line 10. Please also node the job exit code 137 on line 4. It indicates that job was killed by the system, very likely, due to memory overusage if you see the memory usage reported on line 10 is close to the memory requirement on line 9.

1 ------

2 Begin PBS Epilogue Wed Oct 17 10:18:53 CEST 2018 1539764333

3 Job ID: 17635280.dccn-l029.dccn.nl

4 Job Exit Code: 137

5 Username: honlee

6 Group: tg (continues on next page)

72 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

(continued from previous page)

7 Job Name: fake_app_2

8 Session: 15668

9 Asked resources: walltime=00:10:00,mem=128mb

10 Used resources: cput=00:00:04,walltime=00:00:19,mem=134217728b

11 Queue: veryshort

12 Nodes: dccn-c365.dccn.nl

13 End PBS Epilogue Wed Oct 17 10:18:53 CEST 2018 1539764333

14 ------

Note: In addtion to checking the job’s epilogue file, you will also receive an email notification when the job exceeds the requested walltime.

Adjust the rough requirement gradually based on the usage information and resubmit the test job with the new requirement. In few iterations, you will be able to determine the actual usage of your analysis job. A rule of thumb for specifying the resource requirement for the production jobs is to add on top of the actual usage a 10~20% buffer as a safety margin.

Cluster tools

A set of auxiliary scripts is developed to ease the job management works on the cluster. Those tools are listed below with brief description about their functionalities. To use them, simply type the command in the terminal. You could try to apply the -h or --help option to check if there are more options available.

command function checkjob shows job status from the scheduler’s perspective. It is useful for knowing why a job is not started. pbsnodes lists the compute nodes in the cluster. It is one of the Torque client tools. hpcutil retrieve various information about the cluster and jobs. See hpcutil-usage for more detail about the usage.

Using supported software

Commonly used data analysis/process software are centrally managed and supported in the cluster. A list of the supported software can be found here. The repository where the software are organised and installed is mounted to the /opt directory on every cluster node.

Tip: You are welcomed to take initiative for introducing a new software in the repository, with the awareness of the maintainer’s responsibility. See HPC software maintainer for more detail.

Using the supported software via modules

Running a software or application in Linux requires certain changes on the environment variables. Some variables are common (such as $PATH, $LD_LIBRARY_PATH), some are application specific (such as $R_LIBS for R, $SUBJECTS_DIR for Freesurfer.) In order to help configure the shell environment for running the supported software, a tool called Environment Modules is used in the cluster. Hereafter, we introduce few mostly used module commands for using the supported software in the cluster.

2.4. The HPC cluster 73 HPC wiki Documentation, Release 2.0

Note: You should have the module command available if you login to one of the mentat access node using a SSH client (e.g. putty). In the virtual terminal (i.e. GNOME Terminal or Konsole) of a VNC session, the module command may not be available immediately. If it happens to you, make sure the following lines are presented in the ~/.bashrc file. if [ -f /etc/bashrc]; then source /etc/bashrc fi

For example, run the following command in a terminal:

$ echo 'if [ -f /etc/bashrc ]; then source /etc/bashrc; fi' >> ~/.bashrc

Please note that you should close all existing terminals in the VNC session and start from a new terminal. In the new terminal, you should have the module command available.

Showing available software

Firstly, one uses the module command to list the supported software in the cluster. This is done by the following command:

$ module avail ------/opt/_modules ------

˓→-- 32bit/brainvoyagerqx/1.10.4 cluster/1.0(default) matlab/7.0.4 mcr/R2011b 32bit/brainvoyagerqx/1.3.8 cuda/5.0 matlab/7.1.0 mcr/R2012a 32bit/brainvoyagerqx/1.8.6 cuda/5.5(default) matlab/R2006a mcr/

˓→R2012b(default) 32bit/ctf/4.16 dcmtk/3.6.0(default) matlab/R2006b mcr/R2013a 32bit/mricro/1.38_6 fsl/5.0.6 matlab/R2014a python/2.6.5

## ... skip ...

As shown above, the software are represented as modules organised in name and version. From the list, one selects a software (and version) by picking up a corresponding module. Assuming that we are going to run FSL version 5.0.6, the module to chose is named as fsl/5.0.6. Tip: Software are installed in a directory with respect to the hierachy of the module names. For instance, the FSL software corresponding to the module fsl/5.0.6 is installed under the directory /opt/fsl/5.0.6.

Loading software

After chosing a module, the next step is to load it to configure the shell environment accordingly for running the software. This is done via the load command. For example, to configure fsl/5.0.6 one does

$ module load fsl/5.0.6

After that, one can check if a right version of the FSL executable is available. For example,

$ which fsl /opt/fsl/5.0.6/bin/fsl

74 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Tip: You can load more than one module at the same time.

Unloading software

When a loaded software is no longer needed, one can easily rollback the shell environment configuration by unloading the specific module. For instance,

$ module unload fsl/5.0.6

As the configuration for running FSL version 5.0.6 is removed, the FSL executable becomes unavailable. It makes sure that the environment is clean for running other software.

Listing loaded software

In most of cases, you will load several software in one shell environment. To get an overview on the software loaded in the current shell, one can use the list option. For example,

% module list Currently Loaded Modulefiles: 1) fsl/5.0.62) R/3.1.23) cluster/1.04) matlab/R2012b

Pre-loaded software

Right after logging into the cluster, you will find several pre-loaded software. You can find them via module list command. Although you are free to unload them using the module unload command, you should always keep the module cluster/1.0 loaded as it includes essential configurations for running computations in the cluster.

Tip: You should always keep the cluster/1.0 module loaded.

Using the supported software via utility scripts

For mostly used applications in the cluster (e.g. Matlab, R), utility scripts are provided to integrate with job submission to the torque cluster. Those scripts are built on top of the software modules.

Available software

• Matlab • RStudio • Jupyter Notebook

2.4. The HPC cluster 75 HPC wiki Documentation, Release 2.0

Matlab

For running Matlab in the cluster, a set of wrapper scripts are available. They are part of the cluster module. With these wrapper scripts, one does not even need to load any corresponding modules in advance. To start, for example, Matlab version 2014b, simply run the following command.

% matlab2014b

The wrapper script uses internally the environment modules to configure the shell environment. It also decides the way of launching the Matlab program based on the function of the node on which the command is executed. For instance, if the command is executed on an access node, an interactive torque job will be submitted to the cluster to start the Matlab program on one of the computer nodes.

RStudio

Also for running a graphical version of RStudio to do your R analysis, another set of wrapper scripts will submit the job to the HPC cluster. In this case no prerequisitional steps have to be taken as the wrapper scripts will do so for you. To start RStudio, just run the following command on the commandline of your terminal in your VNC session.

% rstudio

The wrapper script starts a menu on which you can select your R/RStudio version combination. The latest versions are shown by default. Select your desired versions and click the OK button.

Next you will be asked for your job-requirements for walltime and memory to submit RStudio as a graphical job to the HPC cluster (just like starting you interactive graphical matlab session. . . ). Define your requirements and hit the OK button.

The menu will close and will return you to your terminal. This shows the job is submitted and the jobID.

76 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

You can check the status of your job with:

% qstat[jobID]

The selected combination of R/RStudio starts, along with the graphical walltime/memory indicator. . .

Jupyter Notebook

Jupyter notebook provides a web-based python environment for data analysis. To star it on the cluster, simply run the following command in the terminal within a VNC session.

% jupyter-notebook

For the moment, only the Jupyter Notebook from Anaconda 3 is supported as it provides token-based protection on the notebook.

Note: When using jupyter-notebook with the conda environment. One should also install jupyter package when creating the enviromnet so that your conda environment will be used within the notebook. For example,

% conda create --name env jupyter

2.4. The HPC cluster 77 HPC wiki Documentation, Release 2.0

78 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Maintaining supported software

HPC software maintainer

In the HPC cluster, we organize commonly used software in a repository that is mounted under the /opt directory on all HPC nodes. While generic and widely used software such as MATLAB, Python, Anaconda, etc. are maintained by the TG, there is also software that requires domain-specific knowledge (e.g., Freesufer, FSL, fmriprep) and that is therefore maintained by researchers.

Introducing new software

As a HPC user, you may request new software to be installed, especially if you see potential benefit to the wider group of users of the HPC cluster. Please send your request and initiate a discussion with the TG by sending a helpdesk ticket. If the request is approved, a maintainer needs to be identified. It can be the TG if the software is a generic tool/framework and if the TG feels capable to maintaine it on the longer term, or it can be a researcher (usually the person who makes the request). Once the maintainer is identified, the TG will create two directories under /opt and /opt/_modules for the software and its module files, respectively. Those directories will be owned by the maintainer’s user account in the HPC cluster so that the maintainer can take responsibility and has the full permission to modify their contents.

Maintainer responsibility

As the maintainer of a specific software package, you are responsible for the following tasks:

Note: Should there be issues while performing those tasks, the maintainer can always contact the TG and ask for help. making a maintenance plan The maintainer should make a short plan for keeping the package up-to-date. Such a plan could for instance exist of checking a github page for new releases or subscribing to a mailinglist. The maintenance plan should be stored as part of the description in the common.tcl file of the software module. installing new software version The maintainer performs installation of a new software version whenever there is a request from user, or based on the upgrade plan. If the TG helpdesk receives a request for a new version, it will be forwarded to the maintainer to follow it up. After the new version has been installed and tested successfully, the maintainer should set the newly installed version as the default, and communicate it with the users (via email or the Mattermost HPC channel). maintaining the environment module for the software Following installation of a software package, the maintainer should also provide a module file so that user can use the software by loading the module. Technical information on this can be found in How to maintain software modules in the HPC cluster. The TG can support you to overcome initial technical hurdles. supporting users on software-specific issues

2.4. The HPC cluster 79 HPC wiki Documentation, Release 2.0

When a user of a specific software package reports issues , the TG helpdesk can be contacted as the first-line support. If the issue is identified to be software-specific (or specific knowledge from the maintainer is needed), the TG will forward the issue to the maintainer to help the user. As a maintainer, you may also write utility scripts to support users. It is recommended to store those scripts in the software directory under /opt, and make sure the scripts are not depending on files outside the /opt. identifying successor and handing over responsibility for maintenance When the maintainer is about to leave the centre, the maintainer should identify a successor or the software will become unmaintained. The responsibility for software maintenance does not automatically move over to the TG. Unmaintained software is left in the repository as it is, until there is a new request asking for updating the software. When that happens, the TG will discuss with the user who makes the request to take over the maintainer responsibility. Unmaintained software are targets for deletion when the storage space of /opt needs to be reclaimed.

How to maintain software modules in the HPC cluster

In the HPC cluster, commonly used software that is not part of the standard Linux system is installed in a centrally managed software repository. The repository is mounted on the access (mentat001 ~ mentat005) and compute nodes under the /opt directory. The user enables a specific software package using the module command, e.g., with module add R/3.4.0.

Installing software and organizing versions

Within /opt, each software package is organized in a separate sub-directory. The maintainer of a specific software package has the owner permission of that sub-directory. Thus, before you can install software under the /opt direc- tory, you need to be the maintainer of the software and take the HPC software maintainer responsibility. The maintainer installs software and organizes versions. An example file system tree structure below shows how different versions of the software R is organized.

/opt/R |--3.1.2 |--3.2.0 |--3.2.2 |--3.3.1 |--3.3.2 |--3.3.3 |--3.4.0 |--3.5.1 `--4.0.1

Tip: For software installed via GNU autoconf (i.e. the installation consists of steps of configure, make, make install), you can use the --prefix to specify the installation destination in the configure step.

Making software available via environment modules

In the cluster, we use the environment modules for controlling the path and (un-)setting software-specific environmen- tal variables. Next to the installation, the maintainer should therefore also create a script for the environment modules. Those module scripts are organized under the /opt/_modules directory.

80 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Within /opt/_modules, each software has its own sub-directory within which module scripts referring to different versions are arranged. An example below shows the file-system tree structure of module scripts for the software R.

/opt/_modules/R |-- .common |-- .version |--3.1.2 -> .common |--3.2.0 -> .common |--3.2.2 -> .common |--3.3.1 -> .common |--3.3.2 -> .common |--3.3.3 -> .common |--3.4.0 -> .common |--3.5.1 -> .common |--4.0.1 -> .common `-- common.tcl

Module scripts are written in the TCL language. To lower the barrier of writing the module scripts, the common (and complex) part has been factored out and is shared. The maintainer only needs to follow the steps below to write the package-specific part.

Note: The steps below assume that the first version (1.0.0) of a new software package is installed under /opt/ new_software/1.0.0.

1. create a module for new software For a newly introduced software, you can initiate the module scripts by, for example, copying the R module scripts. Given an example that the new software is installed in /opt/new_software/1. 0.0, you would do

$ cd /opt/_modules/new_software $ cp /opt/_modules/R/common.tcl . $ cp /opt/_modules/R/.version . $ cp /opt/_modules/R/.common .

The three files we have copied are described below: • common.tcl is a TCL script containing software-specific configuration. It is the main file the maintainer should modify. It consists of environmental variables needed for users to run the software. For example, extend PATH for system shell to locate the software’s executables. It also contains few metadata for describing the software. • .version specifies the default version of the software. It is the version to be loaded if user skips version when loading the module. This file is optional; and if it is not presented, the one (within the software sub-directory, e.g. /opt/new_software) with the highest alphabetical value is used as the default. • .common is the main module script that combines common settings shared among all software modules and the software-specific settings defined in the common.tcl file. By design, the maintainer should not need to modify it. 2. modify common.tcl The complete and official guide for writing module scripts is here. Hereafter is a very simple example of common.tcl for R:

2.4. The HPC cluster 81 HPC wiki Documentation, Release 2.0

1 #!/bin/env tclsh

2

3 set appname R

4 set appurl"http://www.r-project.org/"

5 set appdesc"afreesoftwareprogramminglanguageandsoftware

˓→environmentforstatisticalcomputingandgraphics.

6

7 Thepackagecanbeupgradedonuserrequestormaximallywithinhalfa

˓→yearafteranewrelease.Thedefaultversionisalwayssettothe

˓→latestinstalledversion.

8

9 Youcanaskquestions/seeksforpreviousanswersin:

10 https://mattermost.socsci.ru.nl/dccn/channels/R

11

12 Thispackageismaintainedby[..]."

13

14 ## require $version variable to be set

15 module-whatis [WhatIs]

16

17 ## make sure only one R is loaded at a time

18 if{[ module-info mode load ]}{

19 if{[ is-loadedR ] &&! [is-loadedR/$version ]}{

20 module unload R

21 } 22 if{[ string match"4 *" $version]}{ 23 module load gcc

24 }

25 }

26

27 setenv R_HOME $env(DCCN_OPT_DIR)/R/$version/lib64/R

28 prepend-path PATH"$env(DCCN_OPT_DIR)/R/$version/bin"

29 prepend-path MANPATH"$env(DCCN_OPT_DIR)/R/$version/share/man"

The first three set statements specify the three variables used for describing the software. They are also automatically displayed on the HPC software list page of the DCCN intranet. 1. appname is the name of the software. 2. appurl is the home (or a representative) page URL of the software. 3. appdesc is for a short description of the application and should mention the upgrade and default version policy. If possible, the user should be pointed to a support entry point. Note that only the first line of the description will be displayed on the HPC software list intranet page. The last three lines are about setting environmental variables so that when this module is loaded, the shell will: 1. acquire a new variable R_HOME with the value set to $env(DCCN_OPT_DIR)/R/ $version/lib64/R, 2. prepend path $env(DCCN_OPT_DIR)/R/$version/bin to the PATH variable, and 3. prepend path $env(DCCN_OPT_DIR)/R/$version/share/man to the MANPATH vari- able. In most cases, you will extend the PATH variable and add application-specific variables for the soft- ware, which can be achieved by using the prepend-path and setenv predefined sub-commands of the environment modules. Note the two variables $env(DCCN_OPT_DIR) and $version used in this script. They are

82 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

variables made available to the module file for referring to the top-level directory of the software repository (/opt in this case) and the version of the software the user is (un-)loading, respectively. In this example, the if statement is to resolve version conflict by unloading the already loaded R versions (if it presents) and load a required module (i. e. gcc) when loading certain R versions. More logic can be implemented with the predefined sub-commands of the environment modules. 3. expose the module with version This step is to make a symbolic link to the .common file. The link name should reflect the software version. For instance, if the new software version is 1.0.0, one does

$ ln -s .common1.0.0

After you have the module script setup once, adding module for a new version is usually as simple as making another symbolic to the same .common file. For example, after installing version 2.0.0 of the software to repository, you just do:

$ ln -s .common2.0.0

4. set default version Setting the default version is done by the .version file. Hereafter is an example:

#%Module1.0###############################################################

˓→###### ## ## version file for new_software ## set ModulesVersion"1.0.0"

What you need to change is the value of ModulesVersion to the name of one of the symlinks made in the previous step. You should keep the header line (i.e. the first line) unchanged. In the example above, when a user loads the module for new_software without specifying a version, version 1.0.0 will be loaded.

Tip: You are suggested to set the latest version as the default. Always communicate with users via email or the HPC Mattermost channel when the default version is changed.

Best practices of running jobs on the HPC cluster

In this section, we try to collect various best practices that may be helpful for speeding up your data analysis. Please note that they are developed with certain use-case. Therefore, unless it’s mentioned to be general, take a practice carefully and always think twice whether it’s applicable to your data analysis. If you have questions about the best-practices below or suggestions for new ones, please don’t hesitat to contact the TG helpdesk.

Avoid massive short jobs

The scheduler in the HPC cluster is in favor of less-longer jobs over massive-short jobs. The reason is that there are extra overhead for each job in terms of resource provision and job output staging. Therefore, if feasible, stacking many short jobs into one single longer job is encouraged.

2.4. The HPC cluster 83 HPC wiki Documentation, Release 2.0

With the longer job, your whole computation task will also be done faster given the fact that whenever a resource is allocated for you, you can utilise it longer to make more computations. A trade-off of this approach is that if a job fails, more computing time is wasted. This can be overcome with a good bookeeping in such that results from the finished computations in a job is preserved, and the finished computations do not need to be re-run.

Utilise the scratch drive on the compute node

If your compute jobs on the cluster produce intermediate data during the process, using the scratch drive locally on the compute node has two benefits: • Data I/O on local drive is faster than on the home and project directory provided by a network-attached storage. • It saves storage space in your home or project directory. The scratch drive on the compute node is mounted on the path of /data. A general approach of storing data on it is to create a subdirectory under the /data path, and make the name specific to your job. For exampl, you could introduce a new environment variable in the BASH shell called LOCAL_SCATCH_DIR in the following way: export LOCAL_SCRATCH_DIR=/data/${USER}/${PBS_JOBID}/$$ mkdir -p ${LOCAL_SCRATCH_DIR}

Whenever you want to store intermediate data to the directory, use the absolute path with prefix ${LOCAL_SCRATCH_DIR}. For example, cp /home/tg/honlee/mydataset.txt ${LOCAL_SCRATCH_DIR}/mydataset.txt

It would be nice if your job also takes care of clean up of the data in the /data directory. For example, rm -rf ${LOCAL_SCRATCH_DIR}

Generally speaking, it’s not really necessary as data in this directory will be automatically removed after 14 days. However, it may help other users (and yourself) to utilise the local scratch for large datasets if space is not occupied by finished jobs.

Avoid massive output to STDOUT

It may be handy (and quick) to just print analysis result to the screen (or, in the other word, the standard output). However, if the output is lengthy, it can results in very large STDOUT file produced by your compute jobs. Multiplying the amount of parallel jobs you submitted to the system, it will ends up with filling up your home directory. Things can easily go wrong when your home directory is full (i.e. out of quota), such as data loss. A good advicce is to output your analysis to a file with good data structure. Most of analysis tools provides their own data structures, e.g. the .mat file of MATLAB or the .RData file of R.

2.4.4 Application software

Exercise: Using the environment modules to setup data-analysis software

In this exercise we will learn few useful commands for setting up data-analysis software in the cluster using the environment modules. Environment modules are helpful in organising software, and managing environment variables required by running the software.

84 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

The tasks below use the software R to illustrate the general idea that is applicable to setup other data-analysis software installed in the cluster.

Note: DO NOT just copy-n-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

Tasks

1. List the configured software The following command is used to check what are software currently configure/setup in your shell environment:

$ module list Currently Loaded Modulefiles: 1) cluster/1.03) matlab/R2018b5) freesurfer/6.0 2) project/1.04) R/3.5.16) fsl/6.0.0

Configured software is listed in terms of the loaded modules. You probably notice a message similar to the one above in the terminal after you login to the cluster’s access node. This message informs you about the pre-loaded environment modules. It implies that your bash shell has been configured with proper environment variables (e.g. PATH) for running those software/version right away after the login. 2. List available software

$ module avail

Environment modules for the software are organised in software names and versions. 3. List available versions of R

$ module avail R

You may replace R with matlab, freesurfer or fsl to see versions of different software. 4. Show the changes in environment variables w.r.t. the setup for R version 3.2.2

$ module show R/3.2.2

5. Check current value of the $R_HOME environment variable

$ echo $R_HOME /opt/R/3.1.2

As the default R version, the $R_HOME variable is set to point to version 3.1.2. 6. Setup the environment for R version 3.2.2 Firstly, unload the default R with

$ module unload R

, and load the specific R version with

$ module load R

2.4. The HPC cluster 85 HPC wiki Documentation, Release 2.0

Following to it, check the $R_HOME variable again, it should be pointed to a directory where the version 3.2.2 is installed. You should be ready to use R version 3.2.2 in the cluster.

$ echo $R_HOME

7. Don’t like 3.2.2 and want to switch to 3.3.1 . . . Do you know how to do it?

Exercise: distributed data analysis with MATLAB

In this exercise, you will learn how to submit MATLAB jobs in the cluster using two approaches that are commonly used at DCCN. The first approach is to use a wrapper script called matlab_sub; while the second is to submit batch jobs right within the graphical interface of MATLAB.

Note: In this exercise, we will use commands in MATLAB and in Linux shell. When you see the commands started with a prompt $, it means a command in Linux shell. If you see >>, it implies a command to be typed in a MATLAB console.

Preparation

Follow the steps below to download the prepared MATLAB scripts.

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_matlab/matlab_exercise.tgz $ tar xvzf matlab_exercise.tgz $ ls matlab_sub qsub_toolbox

Task 1: matlab_sub

When you have a MATLAB script file (i.e. the M-file) which takes no input argument, you can simply submit a job to run on the script using the matlab_sub command. In this task, you are given a M-file which generates a 8x8 magic matrix, makes a sum of the diagonal elements, and finally saves the sum to a file. Follow the steps below for the exercise: 1. Switch the working directory in which the M-file is provided

$ cd matlab_sub $ ls magic_cal.m

2. Read and understand the magic_cal.m script 3. (Optional) Choose a desired MATLAB version, e.g. R2014b

$ module unload matlab $ module load matlab/R2014b

As long as you are fine with the default version of MATLAB, you can leave this step out. The default version of MATLAB can be checked with:

86 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

$ module avail matlab

4. Submit a job to run the script

$ matlab_sub magic_cal.m

You will be asked to provide the walltime and memory requirements of the job.

Tip: You can bypass the interaction of providing memory and walltime requirements by using the --mem and --walltime options of the matlab_sub script. The example below submits a job requesting resource of 4 GB memory and 1 hour walltime.

$ matlab_sub --walltime 01:00:00 --mem 4gb magic_cal.m

5. Monitor the job until it is finished. You will see the output file magic_cal_output.mat containing the result.

Task 2: qsubcellfun

1. Start matlab interactive session with the command

$ matlab2014a

2. In the matlab graphical interface, type the following commands to load the MATLAB functions for submitting jobs to the cluster. Those functions are part of the FieldTrip toolbox.

>> addpath'/home/common/matlab/fieldtrip/qsub'

3. Switch the working directory to which the prepared MATLAB functions are located. For example,

>> cd qsub_toolbox >> ls qsubcellfun_demo.m qsubfeval_demo.m qsubget_demo.m randn_aft_t.m

4. Open the file randn_aft_t.m. This matlab function keeps refreshing a n-dimentional array for a duration. It takes two arguments: n for the array dimention, and t for duration. You could try to run it interactively using the MATLAB command below:

>> n_array = {10,10,10,10,10}; >> t_array = {30,30,30,30,30}; >> out = cellfun(@randn_aft_t, n_array, t_array,'UniformOutput', false); >> out

out =

Columns 1 through 4

[10x10 double] [10x10 double] [10x10 double] [10x10 double]

Column5

[10x10 double]

2.4. The HPC cluster 87 HPC wiki Documentation, Release 2.0

5. The cellfun function above makes five iterations sequencially over the randn_aft_t function. For every iteration, it fill in the function with n=10 and t=30. Using the cluster, the iterations can be made in parallel via the qsubcellfun function. For example,

>> out = qsubcellfun(@randn_aft_t, n_array, t_array,'memreq', 10 *10*8,'timreq', ˓→30,'stack',1);

Note: The qsubcellfun will block the MATLAB console until all submitted jobs are finished.

Task 3: qsubfeval

An alternative way of running MATLAB functions in batch is to use the qsubfeval function. In fact, qsubfeval is the underlying function called by the qsubcellfun for creating and submitting each individual job. Following the steps below to run the same randn_aft_t function using qsubfeval. 1. Start matlab interactive session with the command

$ matlab2014a

2. In MATLAB, load the qsub toolbox from FieldTrip.

>> addpath'/home/common/matlab/fieldtrip/qsub'

3. Switch the working directory to which the prepared MATLAB functions are located. For example,

>> cd qsub_toolbox >> ls jobmon_demo.m qsubcellfun_demo.m qsubfeval_demo.m qsubget_demo.m randn_aft_t.m

4. Submit batch jobs to run on randn_aft_t function, using qsubfeval.

>> n_array = {2,4,6,8, 10}; >> t_array = {20, 40, 60, 80, 100}; >> jobs = {}; >> >> for i=1:5 req_mem = n_array{i} * n_array{i} * 8; req_etime = t_array{i}; jobs{i} = qsubfeval(@randn_aft_t, n_array{i}, t_array{i},'memreq', req_mem,

˓→'timreq', req_etime); end >> >> save'jobs.mat' jobs

Each call of qsubfeval submits a job to run on a pair of n (array dimention) and t (duration). For this reason, we should make iteration ourselves using the for loop. This is different to using the qsubcellfun. Another difference is that the MATLAB prompt is not blocked after job submission. One benefit here is that we can continue with other MATLAB commands without the need to wait for jobs to finish. However, we need to save references to the submitted jobs in order to retrieve the results later. In the example above, references of jobs are stored in the array of jobs. You may also save to the reference to a file and leave MATLAB completely. 5. You probably noticed that the job reference returned from qsubfeval is not the torque job id. The qsublist function is provided to map the job reference to the torque job id. We could combine this function to query the job status, using a system call to the qstat command. For example:

88 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

>> load'jobs.mat' >> >> for j = jobs jid = qsublist('getpbsid',j); cmd = sprintf('qstat %s', jid); unix(cmd); end

6. When all jobs are finished, one could retrive the output using qsubget. For example,

>> load'jobs.mat' >> >> out = {}; >> >> for j = jobs out = [out, qsubget(j{:})]; end >> >> out

Note: After the output is loaded into Matlab with qsubget function, the output file is removed from the file system. If you need to reuse the output data in the future, better save it to a .mat file before you close the Matlab.

Exercise: Running FreeSurfer jobs on the cluster

In this exercise we will construct a small script to run FreeSurfer’s recon-all, and use qsub to submit this script to the cluster for execution.

Preparation

Move into the directory you’d like to work in and download the files prepared for the exercise using this command:

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_freesurfer/FSdata.tgz $ tar -xvf FSdata.tgz $ cd FSdata

Task 1: create the script

1. Open a text editor and create the script called runFreesurfer.sh

#!/bin/bash export SUBJECTS_DIR=$(pwd) recon-all -subjid FreeSurfer -i MP2RAGE.nii -all

2. Set the script to be executable 3. Load the freesurfer module (an example of version 5.3)

2.4. The HPC cluster 89 HPC wiki Documentation, Release 2.0

$ module unload freesurfer $ module load freesurfer/5.3

4. Submit the script to the cluster

$ echo"cd $PWD; ./runFreesurfer.sh" | qsub -l walltime=00:10:00,mem=1GB

5. Verify the job is running with qstat. You should see something like:

$ qstat 11173851 Job ID Name User Time Use S Queue +------11173851.dccn-l029 STDIN dansha0 Q long

6. Because we don’t really want to run the analysis but rather test a script, kill the job with qdel. For example:

$ qdel 11173851

Exercise: running python in the cluster

In this exercise, you will learn how to run Python script in the cluster, using Anaconda and the conda environment.

Preparation

Follow the steps below to download the prepared Python scripts.

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_python/python_exercise.tgz $ tar xvzf python_exercise.tgz $ ls example4d.nii.gz nibabel_example.py

Let’s run the python script, and you should expect some errors as this script requires a python module called nibabel.

$ python nibabel_example.py Traceback(most recent call last): File "nibabel_example.py", line3, in import nibabel as nib ImportError: No module named nibabel

Task 1: Conda environment

Load the anaconda module using the command below:

$ module load anaconda2/4.3.0

, and check which python executable is used, e.g.

$ which python /opt/anaconda2/4.3.0/bin/python

While Anaconda provides a bundle of ready-to-use python packages for data analysis, the conda environment is useful in two perspectives:

90 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

1. It creates isolations between python projects so that requirements and package dependencies in one environment do not spoil other environments. 2. It allows uses to install packages without administrative permission. After the anaconda module is loaded, use the command below to create a conda environment called demo, and have the pip, jupyter and numpy packages installed rightaway.

$ conda create --name demo pip jupyter numpy

At the end of the creation, example commands for activating and deactivating the environment will be given on the terminal. To activate the environment we just created, do:

$ source activate demo

After that you will see changes on the shell prompt. For example, the name demo is shown on the terminal prompt. Now check which python or pip program you will be using:

$ which python ~/.conda/envs/demo/bin/python

$ which pip ~/.conda/envs/demo/bin/pip

You see that the location of the python and pip program is now under your home directory under a conda environ- ment directory we have created. The setting in the shell for the conda environment will be transferred with the job you submitted to the cluster. You could check that by starting an interactive job, and checking the locations of the python and pip programs. They should still be pointed to your home directory under the conda environment.

$ qsub -I -l 'walltime=00:20:00,mem=1gb'

$ which python ~/.conda/envs/demo/bin/python

$ which pip ~/.conda/envs/demo/bin/pip

Tip: You may also firstly submit a job then enter the conda environment after the job start. This may be handy when the conda environment is only needed within the scope of the job, or you want to switch between conda environment for different jobs.

To deactive the environment, do:

$ source deactivate demo

Tip: To deactivate the conda environment, you may also close the terminal in which the conda environment is loaded.

Task 2: Python packages

Let’s activate the conda environment we just created in Task 1.

2.4. The HPC cluster 91 HPC wiki Documentation, Release 2.0

$ source activate demo

When you are in a conda environment, you may install your own packages in your environment if the pip package is available in the environment. Using the following command to check wether the pip is available in the environment:

$ which pip ~/.conda/envs/demo/bin/pip

The output of the command above should be a path started with ~/.conda. Try to install a package called nibabel in your conda environment, using the command below:

$ pip install nibabel

Note: The conda environment is created and installed in your home directory under the path $HOME/.conda/ envs. Environments are organised in different subfolders. When you install new packages in an environment, relevant files will also be created in its own subfolder. Be aware of the fact that conda environments do take space from the quota of your home directory.

Once the installation is done, let’s run the python script in the downloaded tarball again, and it should work.

$ python nibabel_example.py (128, 96, 24,2)

Task 3: Jupyter notebook

Make sure you are in the conda environment we created in task 1; otherwise, do the following commands:

$ source activate demo

Jupyter notebook is a web application for creating and sharing documents containing live (Python) codes. In order to run the live python codes within a conda environment (so that you can access to all python libraries installed in your conda environment), the package jupyter should also be installed in the conda environment. Use the following methods to check it.

$ conda list | grep jupyter jupyter1.0.0 py27_3 jupyter_client5.1.0 py27_0 jupyter_console5.2.0 py27_0 jupyter_core4.3.0 py27_0

If you don’t see jupyter related packages in your conda environment, run the following command to install it

$ conda install jupyter

Within the conda environment, simply run the command jupyter-notebook to start the Jupyter notebook. Try to run the python script nibabel_example.py again in the notebook. It should just work.

Task 4: Spyder

Spyder is one of the integrated development environment (IDE) for Python projects. It is part of the Anaconda package.

92 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

If you just want to make use of the Spyder IDE without the need of loading specific Python modules from your own conda environment, you could simply run the following command on a cluster access node within a VNC session:

$ spyder

You will encounter a graphical dialog through which you can select the Spyder from a specific Anaconda version. The wrapper then submits a job to the cluster to launch the specific spyder version on a computer node. If you want to use specific modules installed in a conda environment, you have to install your own Spyder in the same conda environment. Using the demo conda environment as an example, here are steps to follow: Make sure you are in the conda environment we created in task 1; otherwise, do the following commands:

$ source activate demo

Install the Spyder package, using the conda install command:

Important: DO NOT install spyder from pip install. The spyder installed via pip doesn’t take care of library dependancies and therefore it is very likely to be broken.

$ conda install spyder

Submit an interactive job with your required resource, e.g.:

$ qsub -I -l walltime=1:00:00,mem=4gb

Under the shell prompt of the interactive job, run the following commands to start Spyder:

$ source activate demo $ spyder

You could now check within the Spyder IDE whether the nibabel Python module we installed earlier is still avaiable. For instance, Open the file nibabel_example.py in Spyder, and press the F5 key on the keyboard (or select the Run on the menu). This should give the result in the IPython console (at the right-bottom of the Spyder IDE).

Exercise: distributed data analysis with R

In this exercise, you will learn how to submit R jobs in the cluster using the Rscript, the scripting front-end of R. This exercise is divided into two tasks. The first task is to get you familiar with the flow of running R script as batch jobs in the HPC cluster. The second is more about bookkeeping outputs (R data files) produced by R jobs running concurrently in the cluster.

Note: In this exercise, we will use commands in R and in Linux shell. When you see the commands started with a prompt $, it means a command in Linux shell. If you see >, it implies a command to be typed in a R console.

R packages

See r-packages.

2.4. The HPC cluster 93 HPC wiki Documentation, Release 2.0

Preparation

Follow the steps below to download the prepared R scripts.

$ mkdir R_exercise $ cd R_exercise $ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_R/R_exercise.tgz $ tar xvzf R_exercise.tgz $ ls magic_cal_2.R magic_cal_3.R magic_cal.R

Load environment for R version 3.2.2.

$ module unload R $ module load R/3.2.2 $ which R /opt/R/3.2.2/bin/R

Task 1: simple job

In this task, we use the script magic_cal.R. This script uses the magic library to genera a magic matrix of a given dimension, and calculate the sum of its diagonal elements. The matrix and the sum are both printed to the standard output. 1. run the script interactively, for a matrix of dimention 8

$ export R_LIBS=/opt/R/packages $ Rscript magic_cal.R5 WARNING: ignoring environment value of R_HOME Loading required package: abind [,1][,2][,3][,4][,5] [1,]92 25 18 11 [2,]3 21 19 12 10 [3,] 22 20 1364 [4,] 16 1475 23 [5,] 1581 24 17 [1] 65

2. read and understand the magic_cal.R script 3. run the script to the cluster as a batch job

$ echo"Rscript $PWD/magic_cal.R 5" | qsub -N "magic_cal" -l walltime=00:10:00,

˓→mem=256mb 11082769.dccn-l029.dccn.nl

4. wait the job to finish, and check the output of the job. Do you get same results as running interactively? 5. run five batch jobs in parallel to run the magic_cal.R with matrices in dimention 5,6,7,8,9.

$ for d in{5..9}; do echo"Rscript $PWD/magic_cal.R $d" | qsub -N"magic_cal_$d" -l

˓→walltime=00:10:00,mem=256mb; done

94 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Task 2: job bookkeeping and saving output objects

In the previous task, data objects are just printed to the standard output, which are consequently captured as text in the output files of the jobs. Data stored in this way is hardly be reused for following analyses. A better approach is to store the objects in a R data file (i.e. the RData files), using the save function of R. Given that batch jobs in the cluster will be executed at the same time, writing objects from different jobs into the same file is not recommanded as the concurrency issue may result in corrupted outputs. A better approach is to write outputs of each job to a seperate file. In implies that running batch jobs in parallel requires an additional bookkeeping strategy on the jobs as well as the output files produced from them. In this exercise, we are going to use the script magic_cal_2.R in which functions are provided to • save objects into data file, and • get job/process information that can be used for the bookkeeping purpose. Follow the steps below: 1. run the script interactively

$ Rscript magic_cal_2.R5 WARNING: ignoring environment value of R_HOME Loading required package: abind saving objects magic_matrix,sum_diagonal to magic_cal_2.out.RData ...done

From the terminal output, you see two objects are saved into a RData file called magic_cal_2.out.RData. Later on, you can load the object from this file into R or a R script. For example,

> load("magic_cal_2.out.RData") > ls() [1]"magic_matrix""sum_diagonal" > magic_matrix [,1] [,2] [,3] [,4] [,5] [1,]92 25 18 11 [2,]3 21 19 12 10 [3,] 22 20 1364 [4,] 16 1475 23 [5,] 1581 24 17 >q(save="no")

2. read and understand the magic_cal_2.R script, especially the functions at the top of the script. 3. try to run magic_cal_2.R as batch jobs as we did in the previous task.

Tip: You probably noticed that the functions defined in magic_cal_2.R are so generic that they can be reused for different scripts. That is right! In fact, we have factored out those functions into /opt/cluster/share/R so that you could easily make use of those functions in the future. In the script magic_cal_3.R, it shows you how to load those functions in your R scripts. It also shows you how to construct the name of the RData file using the job information.

2.4.5 Exercises

2.4. The HPC cluster 95 HPC wiki Documentation, Release 2.0

Exercise: interactive job

In this exercise, you will start an interactive job in the Torque cluster. When the interactive job starts, check the hostname of the computer node on which your interactive job runs.

Tasks

Note: DO NOT just copy-n-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

1. submit an interactive job with the following command and wait for the job to start.

$ qsub -I -N 'MyFirstJob' -l 'walltime=00:30:00,mem=128mb'

2. note the prologue message when the job starts. 3. check the hostname of the compute node with the command below:

$ hostname -f dccn-c012.dccn.nl

4. try few linux commands in this shell, e.g. ls, cd, etc.

Tip: In the interactive session, it is just like working in a Linux shell.

5. terminate the job by the exit command

$ exit

After that, you should get back to the Linux shell on the access node where your job was submitted.

Exercise: simple batch job

The aim of this exercise is to get you familiar with the torque client tools for submitting and managing cluster jobs. We will firstly create a script that calls the sleep command for a given period of time. After that, we are going to submit the script as jobs to the cluster.

Tasks

Note: DO NOT just copy-n-paste the commands for the hands-on exercises!! Typing (and eventually making typos) is an essential part of the learning process.

1. make a script called run_sleep.sh with the following content:

#!/bin/bash

my_host=$( /bin/hostname )

(continues on next page)

96 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

(continued from previous page) time=$( date ) echo"$time: $my_host falls asleep ..."

sleep $1

time=$( date ) echo"$time: $my_host wakes up."

Note: Input argument of a bash script is accessible via variable $n where n is an integer referring to the n-th variable given the the script. In the script above, the value $1 on the line sleep $1 refers to the first argument given the the script. For instance, if you run the script as run_sleep.sh 10, the value of $1 is 10.

2. make sure the script runs locally

$ chmod +x run_sleep.sh $ ./run_sleep.sh1 Mon Sep 28 16:36:28 CEST 2015: dccn-c007.dccn.nl falls asleep ... Mon Sep 28 16:36:29 CEST 2015: dccn-c007.dccn.nl wakes up.

3. submit a job to run the script

$ echo"$PWD/run_sleep.sh 60" | qsub -N 'sleep_1m' -l 'nodes=1:ppn=1,mem=10mb,

˓→walltime=00:01:30' 6928945.dccn-l029.dccn.nl

4. check the job status. For example,

$ qstat 6928945

Note: The torque job id given here should be replaced accordingly.

5. or monitor it until it is complete

$ watch qstat 6928945

Tip: The watch command is used here to repeat the qstat command every 2 seconds. Press Control-c to quit the watch program when the job is finished.

6. examine the output file, e.g. sleep_10.o6928945, and find out the resource consumption of this job

$ cat sleep_1m.o6928945 | grep 'Used resources' Used resources: cput=00:00:00,mem=4288kb,vmem=433992kb,walltime=00:01:00

7. submit another job to run the script, with longer duration of sleep. For example,

$ echo"$PWD/run_sleep.sh 3600" | qsub -N 'sleep_1h' -l 'nodes=1:ppn=1,mem=10mb,

˓→walltime=01:10:00' 6928946.dccn-l029.dccn.nl

2.4. The HPC cluster 97 HPC wiki Documentation, Release 2.0

Note: Try to compare the command in step 3. As we expect the job to run longer, the requirement on the job walltime is also extended to 1 hour 10 minutes.

8. Ok, we don’t want to wait for the 1-hour job to finish. Let’s cancel the job. For example,

$ qdel 6928946

Exercise: finding resource requirement

In this exercise, you will use two different ways to estimate the resource requirement of running a “fake” application. We will focus on estimating the memory requirement, as it has significant impact on the resource utilisation efficiency of the cluster resources.

Preparation

Download the "fake" applciation which performs memory allocaiton and random number generation. At the end of the computation, the fake application also produces the cube number of a given integer (i.e. n^3). Follow the commands below to download the fake application and run it locally:

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_resource/fake_app $ chmod +x fake_app $ ./fake_app31

compute for 1 seconds result: 27

The first argument (i.e. 3) is the base of the cube number. The second argument (i.e. 1) specifies the duration of the computation in unit of second. Although the result looks trivial, the program internally generates usage of CPU time and memory. The CPU time is clearly specified by the second input argument. The question here is the amount of memory needed for running this program.

Task 1: with the JOBinfo monitor

In the first task, you will estimate the amount of memory required by the fake application, using a resource-utilisation monitor. 1. Start a VNC session (skip this step if you are already in a VNC session) 2. Submit an interactive job with the following command

$ qsub -I -l walltime=00:30:00,mem=1gb

When the job starts, a small JOBinfo window pops up at the top-right corner. 3. Run the fake application under the shell prompt initiated by the interactive job

$ ./fake_app3 60

98 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Keep your eyes on the JOBinfo window and see how the memory usage evolves. The Max memory usage indicates the amount of memory needed for the fake application. 4. Terminate the interactive job

Task 2: with job’s STDOUT/ERR file

In this task, you will be confronted with an issue that the computer resource (in this case, the memory) allocated for your job is not sufficient to complete the computation. With few trials, you will find out a sufficient (but not overestimated) memory requirement to finish the job. 1. Download another fake application

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_

˓→howto/exercise_resource/fake_app_2 $ chmod +x fake_app_2

2. Try to submit a job to the cluster using the following command.

$ echo"$PWD/fake_app_2 3 300" | qsub -N fake_app_2 -l walltime=600,mem=128mb

3. Wait for the job to finish, and check the STDOUT and STDERR files of the job. Do you get the expected result in the STDOUT file? 4. In the STDOUT file, find out relative information concerning job running out of memory limitation in the Epi- logue section. In the example below, the information are presented on lines 4,9 and 10. On line 4, it shows that the job’s exit code is 137. This is the first hint that the job might be killed by the system kernel due to memory over usage. On line 9, you see the memory requirement specified at the job submission time; while on line 10, it shows that the maximum memory used by the job is 134217728 bytes, which is very close to the 128mb in the requirement (i.e. the “asked resources”). Putting these information together, what happend behind the scene was that the job got killed by the kernel when the computational process (the fake_app_2 in this case) tried to allocate memory more than what was requested for the job. The killing caused the process to return an exit code 9; and the Torque scheduler translated it to the job’s exit code by adding an extra 128 to the process’ exit code.

1 ------

2 Begin PBS Epilogue Wed Oct 17 10:18:53 CEST 2018 1539764333

3 Job ID: 17635280.dccn-l029.dccn.nl

4 Job Exit Code: 137

5 Username: honlee

6 Group: tg

7 Job Name: fake_app_2

8 Session: 15668

9 Asked resources: walltime=00:10:00,mem=128mb

10 Used resources: cput=00:00:04,walltime=00:00:19,mem=134217728b

11 Queue: veryshort

12 Nodes: dccn-c365.dccn.nl

13 End PBS Epilogue Wed Oct 17 10:18:53 CEST 2018 1539764333

14 ------

5. Try to submit the job again with the memory requirement increased sufficiently for the actual usage.

Tip: Specify the requirement higher, but as close as possible to the actual usage.

2.4. The HPC cluster 99 HPC wiki Documentation, Release 2.0

Unnecessary high requirement results in inefficient usage of resources, and consequently blocks other jobs (including yours) from having sufficient resources to start.

Exercise: distribute data analysis in the Torque cluster

This exercise mimics a distributed data analysis assuming that we have to apply the same data analysis algorithm independently on the datasets collected from 6 subjects. We will use the torque cluster to run the analysis in parallel.

Preparation

Using the commands below to download the exercise package and check its content.

$ wget https://github.com/Donders-Institute/hpc-wiki-v2/raw/master/docs/cluster_howto/

˓→exercise_da/torque_exercise.tgz $ tar xvzf torque_exercise.tgz $ cd torque_exercise $ ls run_analysis.sh subject_0 subject_1 subject_2 subject_3 subject_4 subject_5

In the package, there are folders for subject data (i.e. subject_{0..5}). In each subject folder, there is a data file containing an encrypted string (URL) pointing to the subject’s photo. In this fake analysis, we are going to find out who our subjects are, using an trivial “analysis algorithm” that does the following two steps in each subject folder: 1. decrypting the URL string, and 2. downloading the subject’s photo. The analysis algorithm has been provided as a function in the BASH script run_analysis.sh.

Tasks

1. (optional) read the script run_analysis.sh and try to get an idea how to use it. Don’t spend too much time in understanding every detail.

Tip: The script consists of a BASH function (analyze_subject_data) encapsulating the data-analysis algorithm. The function takes one input argument, the subject id. In the main program (the last line), the function is called with an input $1. In BASH, variable $1 is used to refer to the first argument of a shell command.

2. run the analysis interactively on the dataset of subject_0

$ ./run_analysis.sh0

The command doesn’t return any output to the terminal. If it is successfully executed, you should see a photo in the folder subject_0.

Tip: The script run_analysis.sh is writen to take one argument as the subject id. Thus the command above will perform the data analysis algorithm on the dataset of subject_0 interactively.

100 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

3. run the analysis by submitting 5 parallel jobs; each runs on a dataset.

Tip: The command seq 1 N is useful for generating a list of integers between 1 and N. You could also use {1..N} as an alternative.

4. wait until the jobs finish and check out who our subjects are. You should see a file photo.* in each subject’s folder.

Solution

1. a complete version of the run_analysis.sh:

1 #!/bin/bash

2

3 ## This function mimicing data analysis on subject dataset.

4 ## - It takes subject id as argument.

5 ## - It decrypts the data file containing an encrypted URL to the subject's

˓→photo.

6 ## - It downloads the photo of the subject.

7 ##

8 ## To call this function, use

9 ##

10 ## analyze_subject_data

11 function analyze_subject_data{

12

13 ## get subject id from the argument of the function

14 id=$1

15

16 ## determin the root directory of the subject folders

17 if [ -z $SUBJECT_DIR_ROOT]; then

18 if [ -z $PBS_O_WORKDIR]; then

19 SUBJECT_DIR_ROOT=$PWD

20 else

21 SUBJECT_DIR_ROOT=$PBS_O_WORKDIR

22 fi

23 fi

24

25 subject_data="${SUBJECT_DIR_ROOT}/subject_${id}/data"

26

27 ## data decryption password

28 decrypt_passwd="dccn_hpc_tutorial"

29

30 if [ -f $subject_data]; then

31

32 ## decrypt the data and get URL to the subject's photo

33 url=$( openssl enc -aes-256-cbc -d -in $subject_data -k $decrypt_passwd )

34

35 if [ $?==0]; then

36

37 ## get the file suffix of the photo file

38 ext=$( echo $url | awk -F '.' '{print $NF}' )

39

40 ## download the subject's photo

41 wget --no-check-certificate $url -o ${SUBJECT_DIR_ROOT}/subject_${id}/

˓→log -O ${SUBJECT_DIR_ROOT}/subject_${id}/photo.${ext} (continues on next page)

2.4. The HPC cluster 101 HPC wiki Documentation, Release 2.0

(continued from previous page)

42

43 return 0

44

45 else

46 echo"cannot resolve subject data url: $subject_data"

47 return 1

48 fi

49

50 else

51 echo"data file not found: $subject_data"

52 return 2

53 fi

54 }

55

56 ## The main program starts here

57 ## - make this script to take the subject id as its first command-line argument

58 ## - call the data analysis function given above with the subject id as the

˓→argument

59

60 analyze_subject_data $1

2. submit jobs to the torque cluster

$ for id in $( seq15 ); do echo"$PWD/run_analysis.sh $id" | qsub -N"subject_

˓→$id" -l walltime=00:20:00,mem=1gb; done

2.5 The project storage

Researches at DCCN are organised as projects. Research data associated with projects are centrally organised on the project storage. Each project receives a specific directory on the project storage, and it is accessible on the HPC cluster, via the path /project/. For example, the storage for project 3010000.01 is under the path / project/3010000.01.

2.5.1 Managing access permission of project data

Data sharing within the project directory is controlled by a role-based mechanism implemented around the NFSv4 Access Control List technology.

User roles

In the project storage, access permission of an user is governed by the user’s role in the project. There are the four roles defined for the access control. They are listed below:

102 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

role permissions Viewer User in this role has read-only permission. Con- User in this role has read and write permission. tribu- tor Man- User in this role has read, write permission and rights to grant/revoke roles of other users. ager Tra- User in this role has permission to “pass through” a directory. This role is only relevent to a directory. It verse is similar to the x-bit of the linux filesystem permission. See the usage of the traverse role.

Any user who wants to access data in a project directory must acquire one of the roles in the project. Users in the Manager role can grant/revoke user roles.

Tool for viewing access permission

For general end-users, a tool called prj_getacl (as Project Get ACL) is used to show user roles of a given project. For example, to list the user roles of project 3010000.01, one does

$ prj_getacl 3010000.01 /project/3010000.01/: manager: honlee contributor: martyc viewer: edwger traverse: mikveng

One could also apply the prj_getacl program on a path (file or directory) in the project storage. For example,

$ prj_getacl /project/3010000.01/rdm-test /project/3010000.01/rdm-test/: manager: honlee contributor: martyc viewer: mikveng,edwger

Note: • The name prj_getacl should be taken as “Project Get ACL”; thus the last character of it should be the lower-case of the letter L. • Use the -h option to see additional options supported by prj_getacl.

Tool for managing access permission

For the project manager, the tool called prj_setacl (as Project Set ACL) is used for altering user roles of a project. For example, to change the role of user rendbru from Contributor to Viewer on project 3010000.01. One does

$ prj_setacl -u rendbru 3010000.01

Note: The name prj_setacl should be taken as “Project Set ACL”; thus the last character of it should be the lower-case of the letter L.

Similarly, setting rendbru back to the Contributor role, one does the following command:

2.5. The project storage 103 HPC wiki Documentation, Release 2.0

$ prj_setacl -c rendbru 3010000.01

To promote rendbru to the Manager role, one uses the -m option then, e.g.

$ prj_setacl -m rendbru 3010000.01

For removing an user from accessing a project, another tool called prj_delacl (as Project Delete ACL) is used. For example, if we want to remove the access right of rendbru from project 3010000.01, one does

$ prj_delacl rendbru 3010000.01

Note: The name prj_delacl should be taken as “Project Delete ACL”; thus the last character of it should be the lower-case of the letter L.

Changing access permission for multiple users

When changing/removing roles for multiple users, it is more efficient to combine the changes into one single prj_setacl or prj_delacl command as it requires only one loop over all existing files in the project direc- tory. The options -m (for manager), -c (for contributor) and -u (for viewer) can be used at the same time in one prj_setacl call. Furthermore, multiple users to be set to (removed from) the same role can be specified as a comma(,)-separated list with the prj_setacl and prj_delacl tools. For example, the following single command will set both honlee and rendbru as contributor, and edwger as viewer of project 3010000.01:

$ prj_setacl -c honlee,rendbru -u edwger 3010000.01

The following single command will remove both honlee and edwger from project 3010000.01:

$ prj_delacl honlee,edwger 3010000.01

Controlling access permission on sub-directories

It is possible to set/delete user role on sub-directory within a project directory. It is done by using either the -p option, or directly specifying the absolute path of the directory. Both prj_setacl and prj_delacl programs support it. When doing so, the user will be automatically granted with (or revoked from) the traverse role on the parent directories if the user hasn’t had a role on them. For example, granting user edwger with the contributor role in the subdirectory subject_001 in project 3010000.01 can be done as below:

$ prj_setacl -p subject_001 -c edwger 3010000.01

Alternatively, one could also do:

$ prj_setacl -c edwger /project/3010000.01/subject_001

If it happens that the user edwger doesn’t have any role in directory /project/3010000.01, edwger is also au- tomatically granted with the traverse role for /project/3010000.01. This is necessary for edwger to “traverse through” it for accessing the subject_001 sub-directory.

104 Chapter 2. Table of Contents HPC wiki Documentation, Release 2.0

Note: In this situation, user edwger has to specify the directory /project/3010000.01/subject_001 or P:\3010000.01\subject_001 manually in the file explorer to access the sub-directory. This is due to the fact that the user with traverse role cannot see any content (files or directories, including those the user has access permission) in the directory.

The Traverse role

When granting user a role in a sub-directory, a minimum permission in upper-level directories should also be given to the user to “pass through” the directory tree. This minimum permission is referred as the Traverse role. The traverse role is automatically managed by the prj_setacl and prj_delacl programs when managing the access in a sub-directory or a file within a project directory. See Controlling access permission on sub-directories.

2.6 Linux & HPC workshops

Regular workshops are held by the TG. Hereafter are agenda and presentations of the past workshops.

2.6.1 2020 - 2021

• [2020.12.04] Access HPC from home – slides – video recording with timestamps for different topics:

* HPC environment at DCCN * create new VNC server/session (Windows, MacOSX) * request eduVPN Trigon access * eduVPN client installation (Ubuntu Linux, Windows, MacOSX) * connect to VNC server with eduVPN (Windows, MacOSX) * SSH tunneling * connect to VNC server with SSH tunneling (Windows, MacOSX) – further reading: Access from outside of DCCN • [2020.12.18] Linux and BASH scripting – slides

* introduction * Linux basic * BASH scripting – video recording with timestamps for different topics:

* Linux basics * Data transfer between laptop/desktop at home and the HPC cluster * Bash scripting

2.6. Linux & HPC workshops 105 HPC wiki Documentation, Release 2.0

* Closing remarks – further reading:

* Linux tutorial * Introduction to the Linux BASH shell * Linux/BASH cheatsheet * Data transfer to/from the HPC cluster with WinSCP • [2021.01.15] BASH scripting (recap) and HPC job management – slides:

* BASH scripting * HPC job management – video recording with timestamps for different topics:

* recap on BASH scripting * HPC job management · demo: interactive job · demo: batch job · trouble shooting · Tips for resource estimation

* Access application with (env) modules – further reading:

* Introduction to the Linux BASH shell * Linux/BASH cheatsheet * Running computations on the Torque cluster • [2021.01.29] Data management and Git/Git-annex – slides:

* Data storage and management * Git tutorial

2.6.2 Earlier workshops

Note: Workshop agenda and slides below are only accessible within the DCCN network.

106 Chapter 2. Table of Contents