Supercomputing

Accessing CESGA Finis Terrae II Finis Terrae II

4x Login nodes: 306x THIN - 2x Xeón E5-2680v3 (12c/24t) - 128GB RAM - 2x 1TB local drive

4x GPU 306x Thin nodes (as login nodes plus): - 1TB local drive 4x LOGIN 4x GPU nodes (login node plus): Internet Infiniband FDR @ 56 Gbps - + 2x GPUs NVIDIA Tesla K80 2x PHI 2x Xeon Phi nodes (login node plus): - + 2x Intel Xeon Phi 7120P (61c)

1x FAT 1x FAT node: - 8x Intel Xeon E7-8867v3 (16c/32t) - 4TB RAM - ~30TB SAS storage

2 First time connection

# First, let’s create a public/private key pair to avoid using the password every time $ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/[my-local-user]/.ssh/id_rsa): [ ... ]1 # Now, authorize this key to access my Finis Terrae II account $ ssh-copy-id [my-user]@ft2..es [my-user]@ft2.cesga.es's password: # Done! Now let’s copy our project to the to run it there $ scp -r myproject/ [my-user]@ft2.cesga.es:~/ # No more password! # Just reconnect to do our work there $ ssh [my-user]@ft2.cesga.es [[my-user]@fs6803 ~]$ cd myproject/ [[my-user]@fs6803 myproject]$ [ ... ]

1 It is recommended to put a passphrase to protect your private key. Consider using a key-agent to store the unlocked password in memory to avoid writing it every time. Working in a supercomputer

Supercomputers often have a modular system to manage their installed software (compilers, libraries, applications, multiple versions…).

$ gcc --version $ module load gcc/6.3.0 gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16) vs $ gcc --version gcc (GCC) 6.3.0 Other util commands are: module unload , module avail, module list, module help []...

Also, be aware of the different storage options available in Finis Terrae II for use:

Name Purpose Location Size Speed

Home General purpose filesystem. Store code, test results... $HOME 10 GB NFS @ ~100 MB/s Store To store big final computation results. $STORE 100 GB NFS @ ~100 MB/s

Lustre High speed parallel file system used to store big temporal $LUSTRE 3 TB LUSTRE @ ~20 GB/s simulated data. Queuing jobs

To execute in a compute node the use of the queue manager is required. The simplest way is to ask for an interactive session (see compute --help to obtain more details)

[username@fs6803 ~]$ compute [ ... ] salloc: Granted job allocation [username@c6601 ~]$ [ ... ] # Do your compilation or small executions

To enqueue a simple script The simplest way is to use the sbatch command. Prepare a simple script:

#!/bin/sh #SBATCH --ntasks 1 # Task to allocate (processes) #SBATCH --cpus-per-task 8 # Cores-per-task to allocate (useful to guarantee full node utilization) #SBATCH --nodes 1 # Nodes to allocate #SBATCH --partition thinnodes # Partition (or list of partitions) for the job to run on #SBATCH --time 00:30:00 # Wall time to allocate for the task (30 minutes) srun [-n ]

$ sbatch ./simple_script.sh Submitted batch job Slurm

The queue management system is called Slurm, and is widely used in HPC centers.

All the queued jobs are identified by an unique job id. The final output of a queued execution will be written in the working directory in a file with the following format: slurm-.out

To query about the status of the queued job, you can use the squeue command: $ squeue -j # Query a specific job $ squeue -u # Query all jobs for a given user

You can also cancel any previously queued task with the scancel command: $ scancel # To cancel a specific job $ scancel --state PENDING # Cancel all your pending jobs

Other useful commands are: $ scontrol show job -dd # To check details about a queued job $ sinfo # To view information about Slurm nodes