Supercomputing
Total Page:16
File Type:pdf, Size:1020Kb
Supercomputing Accessing CESGA Finis Terrae II Finis Terrae II 4x Login nodes: 306x THIN - 2x Intel Xeón E5-2680v3 (12c/24t) - 128GB RAM - 2x 1TB local drive 4x GPU 306x Thin nodes (as login nodes plus): - 1TB local drive LUSTRE 4x LOGIN 4x GPU nodes (login node plus): Internet Infiniband FDR @ 56 Gbps - + 2x GPUs NVIDIA Tesla K80 2x PHI 2x Xeon Phi nodes (login node plus): - + 2x Intel Xeon Phi 7120P (61c) 1x FAT 1x FAT node: - 8x Intel Xeon E7-8867v3 (16c/32t) - 4TB RAM - ~30TB SAS storage 2 First time connection # First, let’s create a public/private key pair to avoid using the password every time $ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/[my-local-user]/.ssh/id_rsa): [ ... ]1 # Now, authorize this key to access my Finis Terrae II account $ ssh-copy-id [my-user]@ft2.cesga.es [my-user]@ft2.cesga.es's password: # Done! Now let’s copy our project to the supercomputer to run it there $ scp -r myproject/ [my-user]@ft2.cesga.es:~/ # No more password! # Just reconnect to do our work there $ ssh [my-user]@ft2.cesga.es [[my-user]@fs6803 ~]$ cd myproject/ [[my-user]@fs6803 myproject]$ [ ... ] 1 It is recommended to put a passphrase to protect your private key. Consider using a key-agent to store the unlocked password in memory to avoid writing it every time. Working in a supercomputer Supercomputers often have a modular system to manage their installed software (compilers, libraries, applications, multiple versions…). $ gcc --version $ module load gcc/6.3.0 gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16) vs $ gcc --version gcc (GCC) 6.3.0 Other util commands are: module unload <module>, module avail, module list, module help [<module>]... Also, be aware of the different storage options available in Finis Terrae II for use: Name Purpose Location Size Speed Home General purpose filesystem. Store code, test results... $HOME 10 GB NFS @ ~100 MB/s Store To store big final computation results. $STORE 100 GB NFS @ ~100 MB/s Lustre High speed parallel file system used to store big temporal $LUSTRE 3 TB LUSTRE @ ~20 GB/s simulated data. Queuing jobs To execute in a compute node the use of the queue manager is required. The simplest way is to ask for an interactive session (see compute --help to obtain more details) [username@fs6803 ~]$ compute [ ... ] salloc: Granted job allocation <jobid> [username@c6601 ~]$ [ ... ] # Do your compilation or small executions To enqueue a simple script The simplest way is to use the sbatch command. Prepare a simple script: #!/bin/sh #SBATCH --ntasks 1 # Task to allocate (processes) #SBATCH --cpus-per-task 8 # Cores-per-task to allocate (useful to guarantee full node utilization) #SBATCH --nodes 1 # Nodes to allocate #SBATCH --partition thinnodes # Partition (or list of partitions) for the job to run on #SBATCH --time 00:30:00 # Wall time to allocate for the task (30 minutes) srun [-n <srun options>] <your executable> $ sbatch ./simple_script.sh Submitted batch job <jobid> Slurm The queue management system is called Slurm, and is widely used in HPC centers. All the queued jobs are identified by an unique job id. The final output of a queued execution will be written in the process working directory in a file with the following format: slurm-<jobid>.out To query about the status of the queued job, you can use the squeue command: $ squeue -j <jobid> # Query a specific job $ squeue -u <username> # Query all jobs for a given user You can also cancel any previously queued task with the scancel command: $ scancel <jobid> # To cancel a specific job $ scancel --state PENDING # Cancel all your pending jobs Other useful commands are: $ scontrol show job <jobid> -dd # To check details about a queued job $ sinfo # To view information about Slurm nodes.