High Performance Computing for Life Sciences High Performance Computing for Life Science

Instructors • Michael Gribskov

○ Biological Sciences and Science • Glady Andino

○ Rosen Center for Advanced Computing (RCAC) • Nadia Atallah

○ Purdue Cancer Center

HPC-Unix 1 2 High Performance Computing for Life Science

HPC-Unix 1 3 High Performance Computing for Life Science Student

HPC-Unix 1 4 Outline

• Purdue compute clusters • Connecting to HPC clusters • Windowed vs command line interface (CLI) ○ paths ○ commands ○ redirection, piping ○ background ○ symbols • Unix ○ Organizing your work – directories – viewing: pwd, ls, cd – making/removing: mkdir, rmdir – scratch vs home directory ○ Files – viewing: cat, more (less) – copying/renaming: cp, mv – copying between : scp, sftp, globus, wget – file details: wc, grep • Using HPC resources ○ modules ○ Torque/PBS

Practical Genomics - Course Introduction 5 Purdue HPC computing clusters

Frontends (4) Compute nodes (hundreds)

Scholar Cluster

Home Scratch directories directories (Gbytes) (100 Tbyte)

Fortress 50 Petabyte

Snyder Cluster Frontends Scratch User share home directories are directories cluster specific Archival tape storage

Halstead Cluster

HPC-Unix 1 6 Purdue HPC computing clusters Typical Server

Compute nodes hundreds of nodes Frontends (4) SSH SFTP SCP Torque/PBS Job Scheduler

Interactive/GUI Software Transferring files Used Large Compute Jobs MS Word/Excel/PPT Editing Files Multiple CPUs/node for R Programming Large memory (96 Gb – 1 Tb) Web Browser Program Installation Games /home/user /scratch/scholar/u/user Limited disk storage Large disk storage (typically 25 Gb) (100 Tb)

HPC-Unix 1 7 Windowed vs Command Line UI

Windowed Command line • icon indicates type of file • Type color or suffix may indicate type • click folder icon to open of file • click file icon to open with app • view folder with ls • click program icon to run • move to folder with cd • see file contents with cat • type filename to run

• many folders and files open at once • only one file open at a time – unix considers you to be “in” a directory HPC-Unix 1 8 Unix Command line interface • Essential for high-performance computing • Only a few basic commands you MUST know • Many convenient commands and shortcuts you can learn over time

○ see list of tutorials in unix manual

• It’s not that hard – follow the advice of the Hitchhikers Guide to the Galaxy

HPC-Unix 1 9 Unix Command Line Interface Main topics • Connecting to the server using your personal computer • Organizing your work with directories (folders)

○ Creating, moving, deleting • Managing files

○ Creating, moving, deleting • Moving files between the server and your personal computer

HPC-Unix 1 10 Unix files Paths and • Directories == folders • In unix you are always considered to be located in a single directory

○ when you first log in you are “in” your home directory • Directories are just files in a folder

○ Even on your laptop this is true – think of the file browser used to open files

/home/mgribsko

/home/mgribsko/src

/home/mgribsko/src/local_scripts

HPC-Unix 1 11 Unix files Paths and filenames

• A filename has two parts –

○ the name itself, igv_session.xml

○ the directory -- where the file is found /home/mgribsko • A file in any directory can identified by a directory path plus its name

○ /home/mgribsko/igv_session.xml

○ /home/mgribsko/src/pkg_resources.py • Directories in the path are separated by a / (slash)

○ slash (/) and backslash (\) are different characters

HPC-Unix 1 12 Unix files Paths and filenames

• Identifying files

○ A file in the current directory can identified by just its name – local name – for instance, igv_session.xml in my home directory

○ A file can be identified by its – complete path + filename – AKA absolute name – /home/mgribsko/igv_session.xml – fully qualified names always start with /

○ A file can be identified by a relative path + filename – relative names never start with /

○ /home/mgribsko/src/pkg_resources.py fully qualified src/pkg_resources.py relative name (from /home/mgribsko) pkg_resources.py local name (from current directory)

HPC-Unix 1 13 Unix files ../ Paths and filenames /home/mgribsko/src parent directory • Special files/directories ○ relative paths – ./ current directory – ../ parent directory – ../sibling sibling directory • Predefined symbols ./ ../sibling /home/mgribsko/src/local /home/mgribsko/src/bin ○ $HOME – your home (login) directory usually /home/username current directory sibling directory backed up – same on all clusters – permanent

○ $RCAC_SCRATCH – your scratch (working) directory e.g., /scratch/scholar/u/username 100 Tb of storage not backed up – cluster dependent, scholar/snyder/rice/halstead … – not permanent, you must backup to fortress manually

HPC-Unix 1 14 Connecting to server

• MacOS – your computer comes with a built-in application that allows you to connect to the server

○ go to applications

○ start the terminal application

○ type SSH @scholar.rcac.purdue.edu (replace with your actual username) provide password when asked

• Windows – you must use a terminal emulator program

○ see the unix manual for options: MobaXterm, PuTTY, SecureCRT, and others

○ start an SSH session using the emulator of your choice

○ provide username and password when asked

HPC-Unix 1 15 Unix commands Everything is a command

• unix interprets everything you type as a command, even if you just type a file name

○ if it is a unix OS command, it is executed

○ if the file can be executed, it is (e.g., a shell script)

○ otherwise it is an error • commands often have options introduced by

○ - or

○ -- • commands may also have command arguments – usually specifying

a file to work on file name

error message

HPC-Unix 1 16 Unix commands Basic commands

• manipulating directories: create, view, rename, delete • manipulating files: create, view, rename, delete • transferring files to/from clusters • a few useful utility commands

HPC-Unix 1 17 Unix commands Input/Output

• Most commands produce some kind of output

○ output is sent to your standard output device, briefly called STDOUT

○ STDOUT is normally your terminal screen

○ error messages are sent to your standard error device, briefly called STDERR

○ STDERR is normally your terminal screen (same as STDOUT)

• Many commands require additional inputs such as filenames

○ input is read from your standard input device, briefly called STDIN

○ STDIN is normally your keyboard

• STDIN, STDERR, and STDOUT can all be redirected to refer to files in your directory

○ this allows a series of commands to be stored in a file for use/reuse

HPC-Unix 1 18 Unix commands Input/Output • the < (less than) and > (greater than) symbols are used to redirect input and output (usually to files)

○ example: output of ls command redirected to file “a.a” note that a new file “a.a” is created scholar-fe00: ls fasta_split.pl fivep_fa lenhist.pl probknot release threep_fa

scholar-fe00: ls > a.a scholar-fe00: ls a.a fasta_split.pl fivep_fa lenhist.pl probknot release threep_fa

○ example: input for cat command redirected from keyboard to file “a.a” my typing is in red, computer response in black scholar-fe00: cat ldskfj ldskfj lskjfl;s lskjfl;s

scholar-fe00: cat < a.a a.a fasta_split.pl fivep_fa lenhist.pl probknot release threep_fa

HPC-Unix 1 19 Unix commands Input/Output • Piping – the output of one command can also be sent into the input of another command

○ At first this doesn’t seem that useful, but it is one of the features that makes unix an excellent OS for HPC

○ piping is accomplished by inserting | (vertical bar) between the commands

○ here is a useless example – ls src | cat scholar-fe00: ls probknot | cat 1. run the ls command on directory “src” FBtr0307555.fasta FBtr0307555.fasta.ct FBtr0307555.fasta.ct.dot 2. send the output of ls to the input of cat FBtr0307555.fasta.ct.svg FBtr0307555.fasta.mexexpect.ct 3. run the cat command FBtr0307555.fasta.mexexpect.ct.dot FBtr0307555.fasta.mexexpect.ct.svg ... • Piping is important, but its use will be more obvious when we see real examples – don’t worry too much about it for now

HPC-Unix 1 20 Unix commands Background jobs • unix is primarily an interactive OS ○ commands are entered on the command line ○ until the command is completed, no further commands can be entered • this is not very convenient if the command you are running takes a long time to run – e.g., transferring a large file or assembling a genome • for commands that take a long time, unix can be instructed to start the command and immediately return to a command line ○ this is called running a command “in the background” • there are two ways to run a command in the background ○ type the command, including any redirection, as normal, followed by “&”, e.g. scholar-fe00: cp bigfile.fastq /scratch/scholar/m/mgribsko/big.data.fq & ○ type the command and let it start running then type ctrl-z, followed by bg – ctrl-z suspends the current process – the bg command restarts the suspended process as a background process – if you don’t restart the suspended process with bg (or fg) it will remain active forever (not good)

HPC-Unix 1 21 Unix commands Symbols • the unix shell is completely programmable • you can define your own commands • you can define symbols that save typing, e.g., $scratch instead of /scratch/scholar/m/mgribsko • See the section on customizing • consult one of the tutorials listed in the manual for information on bash shell programming

• A few symbols are predefined and are useful to know • unix symbols begin with $ and are often in all upper case

○ $RCAC_SCRATCH (your directory on the scratch filesystem)

○ $HOME (your home/login directory) • symbols can be used in commands just as if there values were typed

○ ls $RCAC_SCRATCH/avodado is the same as ls /scratch/scholar/m/mgribsko/avocado

HPC-Unix 1 22