Linux for Biologists
Total Page:16
File Type:pdf, Size:1020Kb
Linux for Beginners – Part 2 3CPG Workshop Robert Bukowski Computational Biology Service Unit http://cbsu.tc.cornell.edu/lab/doc/Linux_workshop_Part2_Nov_2011.pdf Topics CBSU/3CPG Lab Part 1: (March Nov. 7, 2011) cbsuwrkst2,3,4 (Linux) Reserving time on 3CPG Lab workstations 3 “interactive” Logging in to a Linux workstation machines with nice consoles (also Terminal window and tricks accessible remotely) Linux directory structure cbsum1c1b00n (Linux) Working with files cbsum1c2b00n (Linux) Working with text files 31 “remote” machines Part 2: (today) cbsulm01,cbsulm02 (Linux, 64 and 500 GB Transferring files to/from workstations RAM) Running applications Note: this will only cover the Linux aspect of running applications; the functionality and the biological aspect will be covered in workshop Using BioHPC Lab Software on Nov. 28, 2011. Basics of scripting (shell and Perl) Note: this will not teach you scripting – just get you started. We are planning a series of workshops on Perl in the fall – stay tuned. In the meantime - use multiple resources online (google “Perl tutorial”, for example). Disk usage guidelines: Local vs. network directories (3CPG LAB – specific) cbsuwrkst2 Network directories / /home, /programs, /shared_data ……… |-- home/bukowski (with all subdirectories) |-- programs/ |-- shared_data/ • Physically located on the file `-- workdir/bukowski server • Visible from all workstations Cbsuwrkstfsrv1 cbsuwrkst3 (file server) • Relatively SLOW access – DO NOT / / run any calculations there, avoid ……… ……… |-- home/bukowski |-- /bigdisk/home/bukowski transferring large files there |-- programs/ |-- /bigdisk/programs/ |-- shared_data/ |-- /bigdisk/shared_data/ `-- workdir/bukowski Local directories: /workdir (with all subdirectories), all cbsuwrkst4 other directories / • Physically attached to “its own” ……… |-- home/bukowski workstation |-- programs/ • Not visible from other |-- shared_data/ `-- workdir/bukowski workstations • Fast access – all calculations should be run in /workdir Disk usage guidelines (3CPG lab specific) Your home directory (e.g., /home/bukowski) • Is network-mounted and therefore access to it is slow • Visible from each workstation – no matter which one you log in to • 200 GB quota will be imposed (may change depending on conditions) • Use it to store files which you use frequently (reference genomes, index files) or which are small and hard to replace (scripts and executables) • Never run any disk intensive applications (all Next-Gen tools are disk intensive) with your home directory (or any of its subdirectories) as the “current directory”. Work on /workdir instead. The /workdir directory • Is local to its workstation (located on disks physically attached to the machine’s controller) • Not visible from other workstations • Temporary – the content of /workdir may be erased after you log out. When you log in again, your files may be no longer there • After you log in, create your own subdirectory in /workdir (if not already there) • All the files to be used in processing have to be moved/copied to that subdirectory • Applications have to be started in that subdirectory • Important output files have to be copied back to the home directory or (better yet) out of the machine. Checking disk space How much disk space is taken by my files? du –hs . (displays combined size of all files in the current directory and recursively in all its subdirectories) du –h --max-depth=1 . (as above, but sizes of each subdirectory are also displayed) How much disk space is available? df -h /workdir is a part of it – this space is yours during your reservation Your home directory is a part of it – you share space with other users File transfer between PC or Mac and a lab workstation On Windows PC: install and use your favorite sftp client program, such as • winscp: http://winscp.net/eng/index.php • CoreFTP LE: http://www.coreftp.com/ • FileZilla (client): http://filezilla-project.org/ • … others… • When connecting to Lab workstations from a client, use the sftp protocol. You will be asked for your user name and password (the same you use to log in to the lab workstations). • Transfer text file in text mode, binary files in binary mode (the “default” not always right). • All clients feature • File explorer-like graphical interface to files on both the PC and on the Linux machine • Drag-and-drop functionality On a Mac: file transfer program is fetch (recommended by Cornell CIT) • http://www2.cit.cornell.edu/services/systems_support/filefetch.html#fetchinst • graphical user interface • Drag-and-drop functionality Large files (> 1 GB) should be transferred to your subdirectory under /workdir (e.g., /workdir/bukowski). Avoid storing such files in your home directory. Never process such files in home directory (or any subdirectory of your home) File transfer fixing Windows/Mac – Linux text file conversion problems unix2dos my_file (convert a text file in linux format my_file to Windows/Mac format, i.e., change line endings) dos2unix my_file (convert a text file my_file in Windows/Mac format to Linux format, i.e., change line endings) File transfer between a lab workstation and another Linux machine Suppose we want to transfer a file from cbsuss04.tc.cornell.edu (another Linux machine; substitute “your” Linux machine here) and cbsuwrkst2 lab workstation. Option 1: when logged in to cbsuwrkst2, sftp to cbsuss04 by running the following commands: cd /workdir/bukowski (this is where we want the file to be on cbsuwrkst2) sftp [email protected] (instead of “bukowski”, use your own user name on cbsuss04; you will be asked for password) cd /data/bukowski/reads (on cbsuss04, go to the directory where the file is) get my_read.fastq (transfer, or “get” the file from cbsuss04) quit (exit sftp client and disconnect from cbsuss04 – we are back on cbsuwrkst2) Option 2: when logged in to cbsuss04, sftp to cbsuwrkst2 by running the following commands: cd /data/bukowski/reads (this is where the file is on cbsuss04) sftp [email protected] (instead of “bukowski”, use your own user name on cbsuss04; you will be asked your lab password) cd /workdir/bukowski (on cbsuwrkst2, go to the directory where the file is supposed to be stored) put my_read.fastq (transfer, or “put” the file on cbsuwrkst2) quit (exit sftp client and disconnect from cbsuwrkst2– we are back on cbsuss04) File transfer from web- and ftp sites to lab workstations Option 1: download using a web browser on workstation. While logged in to the workstation, execute the following: • firefox (this will start the Firefox browser on the workstation) • If you are working remotely from a PC and not using VNC, you will need to have Xming running. Note: Firefox browser you just started is running on Linux workstation, your PC is just displaying the browser’s window. May be slow on slow networks… • Navigate to the site you want to download the file from, click on download link. The browser will ask for destination directory (on the workstation !) to put the file in. Select a directory (should be in /workdir if the file is large) and let the browser complete the download. • Close Firefox browser if no longer needed. File transfer from web- and ftp sites to lab workstations Option 2: run wget command on the workstation (if you know the URL of the file) • Example: wget ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM100 (will download the file BLOSUM100 from the NCBI FTP site and deposit it in the current directory under the name BLOSUM100) • another Example (the following should be typed on one line): wget -O e_coli_1000_1.fq “http://cbsuapps.tc.cornell.edu/Sequencing/showseqfile.aspx?cntrl=646698859&laneid=487&mode=http&file=e_coli_1000_1.fq” (the command above can be used to download files given by complicated URLs; note the “” marks around the link and the –O option which specifies the name you want to give the downloaded file) More about commands • Each command is, in fact, an executable program stored somewhere on disk, usually in places like /bin, /usr/bin, or /usr/local/bin • which mv (tells us where on disk the command mv is located) • Why can we just use mv rather than the full name /bin/mv ? Because of the search path environment variable which is defined for everybody. The you type mv, the system tries each directory on the search path one by one until it finds the corresponding executable. • echo $PATH (displays the search path) • Note: the current directory ./ is NOT in the search path. If you need to run a program located, say in your home directory, you need to precede it with ./, for example, ./my_program • The next-gen analysis applications installed on the workstations are also in your $PATH. Thus, you can launch them using just the name rather than the full path: • Example: command samtools is equivalent to /programs/bin/samtools/samtools Example project Objective: align Illumina reads to D. Melanogaster genome • Download data from FTP server (using Firefox or wget command) • Extract files: reference genome (FASTA) and Illumina reads (FASTQ) • Index reference genome (i.e., prepare it for use with BWA aligner program) • Align reads using BWA aligner • Convert alignments to SAM format • Convert alignments in SAM format to BAM format Download/unpack project data Create a directory on local scratch disk (/workdir) cd /workdir/bukowski mkdir d_melanodaster cd d_melanogaster Download the tgz archive with sample files wget ftp://cbsuftp.tc.cornell.edu/software/CBSUtools/Linux2workshop/fly_example.tgz Unpack the tgz archive tar -xzvf fly_example.tgz