Line DIRECTORIES FILES Reference for Bioinformatics UNCOMPRESS TO DIR information on dealing with large text files is The commands sequence of commands below allow listed under the FASTA FILES heading. DIRECTORIES you to uncompress an entire directory tree to a single directory. This is useful if you have COPY, RENAME & MOVE FILES downloaded sequence trace files from genbank and CREATE/DELETE DIRECTORIES MySeqs.fasta MyCopy.fasta you would like all of the data in a single directory. Copy the to MyCopy.fasta SeqDir the directory SeqDir cp *.fasta SeqDir/ - output -zxf archive.tar.gz Copy all files with the fasta extension to the SeqDir output destination folder. Remove the empty directory SeqDir. . - f -exec -i {} . \; mv MySeqs.fasta New.fasta -rf SeqDir/ find * -type d -prune -exec rm -rf {} \; Rename the MySeqs.fasta file to New.fasta If the directory is not empty, you can can delete the dir and all subdirs and files using mv *.fasta SeqDir/ the rm command with the r and f options. Move all files with the fasta extension to the SeqDir directory. NAVIGATION FILE PERMISSIONS cd SeqDir Change your current working directory to ### MyFile.txt ie: the SeqDir. Using cd without any options chmod 755 MyProgram.pl 'cd' will take you to your home directory. Change the permissions associated with files and directories. (ie make a PERL cd .. Change to the parent directory. programs executable). The ### refer to file cd /home/username/Dir/SubDir permission numeric code for Change dir using the full directory path. FILE COMPRESSION DIR INFORMATION MyFile.txt List the full path of your current directory. This will gzip the file MyFile.txt List the files in the current directory. gunzip MyFile.txt.gz ls -alh This will unzip the file MyFile.txt List all files and show file size in a human MyFile.txt readable format. the file with bzip (better). ls -l | -l bunzip2 MyFile.txt.bz Count the number of files in a directory. Unzip the bzipped file.

DIR COMPRESSION FIND FILES tar -cvf SeqDir.tar SeqDir/ or MyFile.txt tar -cvfz SeqDir.tar.gz SeqDir/ The locate command can be used to find Use the tar (Tape Archive) program to the location of files on your hard drive. archive the directory named SeqDir. Use of the z option will the archive. Use tar -xvf SeqDir.tar Use x to extract the tar archive. GENERAL PROGRAMS RESOURCE USAGE FASTA FILES man progname It is important to keep track of your resource usage For a FASTA file named MySeqs.fasta: The man command displays the manual for in multiuser environments. The following Linux command line programs. commands you keep track of your storage, and FILE OVERVIEW processor use on your Linux machine. ls *.fasta The nohup command allows you to close Show all fasta files in the directory. your terminal connection to the Linux DISK USAGE -c '>' MySeqs.fasta machine but keep your program running. quota Count the number of sequence records. See what your disk usage quota is on the wc -l MySeqs.fasta Clear the screen. current machine. You may have no quota. Count the number of lines in the file. -h wc -c MySeq.fasta Set your user password. Look the amount of disk space used by Count the total number of characters. you and everyone else on the server. -h --max-depth=1 VIEW FILE SPECIAL CHARACTERS Display your disk use in the current dir. MySeqs.fasta This is a good way to check files or View the entire fasta file. directories are using up disk space. -n 50 MySeqs.fasta or | Pipe Output head -n 300 MySeqs.fasta | less You can pass output from one program to PROCESSES Look at the beginning of a fasta file. Use another program using the pipe character |. (-n) to select the number of lines. For large Examples are: Display the top CPU processes on the local n pipe the output to the less utility. ls | less or head myfile | less machine. This will show the processes as -n 50 MySeqs.fasta or > to File well as their memory and processor usage. The > character can be used to send tail -n 300 MySeqs.fasta | less -ef Look at the end of the fasta file. program output to a text file. Examples: Show all processes currently running ls > File.txt or ps -ef | grep username locate perl > Perl.txt AND FILES Show only your processes. If a process is *.fasta > AllSeqs.fasta >> Append to File running that you want to stop, use the '' Results are appended to the outfile. Combine all fasta files in the current command. directory into a single file. * Wildcard Character kill PID AllSeqs.fasta '/>/' {*} or The asterix is often used as the wildcard Kill the process identified by the process id character. It will match a set of characters csplit -f -n 8 AllSeqs.fasta '/>/' {*} (PID). The PID can be determined using Split the fasta file into a separate fasta file for for any length. Example use: ls *.fasta the ps utility. WARNING: 'kill -9' is the each record. The following options are available Wildcard Character ? nuclear option and will kill the hell out of csplit: The question mark is a wildcard for a single your runaway process; it may however -f Prefix in output names character. trash your database, files etc. -n Num digits long for output names -b Suffix for output names USERS Show who else is logged on. finger username Get information about the user including real name, home dir etc. BASIC PERL EMACS TEXT EDITOR NCBI BLAST

For a PERL program named MyPerl.pl: Emacs is a powerful text editor available on many The NCBI Standalone BLAST program is available linux distributions. Emacs makes heavy use of the for download from NCBI: MODIFY AND RUN PROGRAMS Clt, Meta (or ALT) and Shift keys. These are http://www.ncbi.nih.gov/BLAST/download.shtml. emacs MyPerl.pl indicated below as C, M and S. To launch emacs Use the emacs text editor to edit the from the command line simply type: formatdb -p F -i MySeqs.fasta -t Seq -n Seq program. emacs MyProgram.pl Format the fasta file named MySeqs.fasta. chmod 755 MyPerl.pl This will open the file MyProgram.pl for For more variables available type 'formatdb Make the program executable by you and editing in emacs. If the file does not already –help'. The title {-t} and name {-n} of the other people in your group and anyone else exist, a new file will be created. database will both be set to Seq. on the server but other people do not have C-h Online help wrtie access to the program. C-g Stop current operation. blastall --help ./MyPerl.pl Display the NCBI BLAST help. Run the perl program 'MyPerl.pl' in the FILES current directory. C-x C-s Save the current file. blastall -p program -i infile -d DB -o outfile C-x C- Save the file to a new name. • program is one of: LOOPS C-x C-c Close the current file. Query Database for ( $i=0; $i<=$MaxNum; $i++) {} C-x d Open the directory. blastn Nucleotide Nucleotide Loop variable $i from zero to MaxNum C-x i Insert another file. blastp Protein Protein blastx Trans. Nucl. Protein FREQUENTLY USED PERL MODULES EDIT tblastn Protein Trans. Nucl. DBI tblastx Trans. Nucl. Trans. Nucl. Backspace Delete previous character. Database interface for connection to Kill to end of the current line. • infile is a fasta formatted text file database servers (MySQL). C-k . is a blast database created using formadb Getopt::Std C-y • DB Accept command line arguments C-S-_ Undo. • outfile is the path of the output file. . Term::ANSIColor C-w I like to give the outfile the *.blo extension to Print in color. Useful for drawing attention SEARCH represent this as a blast output. Go to end of the buffer. to error messages, table headers etc. M-S-> • A number of other command line options are example: C-s Search forward available for blastall. These include: print color 'bold red'; C-r Search backward -a Number of processors to use print “WARNING\n”; -e E-value cutoff print color 'reset'; CURSOR MOVEMENT -U Mask out lowercase letters Recenter, refresh screen (lc L) Text::Wrap C-l -G Cost to open a gap Move to beginning of current line I use this for printing of sequence C-a -E Cost to extend a gap Move to end of the current line residues that are tabbed over. C-e -W Default word size M-f Move forward one work M-b Move backward one word C-v Move forward one screen M-v Move back one screen M-S-< Go the beginning of the buffer M-x goto-line Goto line number SFTP

SFTP is a secure file transfer program that comes installed by default with Linux distributions. This is the most secure way to transfer files from the command line.

CONNECTING sftp ftp.here.edu Connect to the ftp server at the address specified. You will be prompted for a valid user name and password. Quit the SFTP session. Also: quit help Display SFTP help. Also: ? ! Escape to the local shell ! cmd Run command 'cmd' in the local shell

DIRECTORY NAVIGATION mkdir MyDir Create a directory on the ftp server lmkdir MyDir Create a directory on the local machine pwd Display the remote working dir. lpwd Display the local working dir. cd Change dir on the ftp server. lcd Change dir on the local machine ls List files on the server dir lls List files in the local dir

TRANSFERRING FILES get myfile.fasta Download a file from the server. get *.fasta Download multiple files. put myfile.fasta Upload file from the localmachine to the ftp server. put *.fasta Upload multiple files. MySQL SELECT * FROM tblName; SELECT * FROM tblName\g Print all of the records from the table. MySQL is a great freeDatabase commands. Use of \g Remember that all MySQL commands must end SHOW COLUMNS FROM tblName; with ;or \g . Show the names of the table columns. USE dbName; GETTING STARTED Use the Selected database OUTPUT using using the \g

CREATE A DATABASE

CREATE DATABASE dbName; Create a new database named 'dbName'. SHOW DATABASES; Show all databases in mysql. USE dbName; Use the Selected database.

CREATING TABLES

CREATE TABLE tblName (ColOne integer, ColTwo char(10), ColThree integer); Creates a table with three columns named ColOne, ColTwo and Col Three. SHOW DATABASES; Show all databases in mysql. USE dbName; Use the Selected database.

WORKING WITH TABLES

For a table name tblName;

SELECT COUNT(*) FROM tblName; Count the number of records in the table; BioPERL Reference Card BIOPERL BLAST PARSING BIOPERL BLAST PARSING Reference for Bioinformatics HIT HSP (CONT'D) name Name of the matching sequence. length('hit') BIOPERL BLAST PARSING length Length of aligned hit minus gaps Total length of the hit sequence length('query') OVERVIEW accession Length of aligned query minus gaps use Bio::SearchIO; Accession number of the hit seq. num_conserved $in = new::Bio::SearchIO( description Number of conserved residues format => 'blast', Description of hit seq. num_identical file => 'FilePath') algorithm Number of identical residues while ($result = $in->next_result) Blast algorithm use (ie. blastn) rank Rank of the HSP while ($hit = $result->next_hit) raw_score score Score while ($hsp = $hit->next_hsp) Raw score of the match. bits HSP score in bits significance range('query') RESULT Significance of the match Start and end of qry as an array algorithm bits Bit score of the match range('hit') The algorithm used (ie. blastn) num_hsps Start and end of hit has an array algorithm_version Total number of hsps percent_identity algorithm version (ie. 2.2.12) locus Locus name of the hit Percent identical in HSP alignment query_name accession_number strand('hit' or 'query') Name of the query sequence Accession number Strand of the hit or query. query_accession hsps Returns all hsps for hit start('query' or 'hit') Accession number of query sequence Start position of the hit or query query_length HSP end('query' or ''hit') Length of the query sequence algorithm End position of the hit or query. query_description BLAST algorithm used. (ie blastn) Description of query sequence evalue new::Bio::SearchIO database_name E Value of HSP file Path to input file Name of the database use for query frac_identical format database_letters Fraction of residues identical. Format of the IO (ie. blast) Number of residues in the query frac_conserved -report_type database_entries Fraction of residues conserved (proteins) -inclusion_threshold Number of records in the database gaps Number of gaps in alignment. signif E value cutoff available_statistics query_string score Blast Score value cutoff Stats use for the BLAST search Query sequence from alignment bits Bit value cutoff available_parameters hit_string hit_filter Parameters used for the BLASTsearch Hit sequence from alignment overlap num_hits homology_string The total number of hits for the query. Homology string from alignment More information available at: hits length('total') http://bioperl.org/wiki/HOWTO:SearchIO Returns all the hits for the query sequence Length of hsp including gaps BIOPERL SEQ OBJECT BIOPERL SEQ OBJECT BIOPERL HMMER PARSING

Information that can be fetched from the BioPERL Bio::Seq HMMER is a program that uses profile hidden Seq Object Markov models to identify protein families. seq() $ http://hmmer.janelia.org/ OVERVIEW Sequence string use Bio::Seq; subseq(i,j) $ OVERVIEW $seq_in = Bio::SeqIO->new ( Substring of sequence from position i to j use Bio::Tools::HMMER::Results; '-format' => 'fasta', accession_number()$ $HmmRes = new::Bio::Tools::HMMER::Results ( '-file' => '<$infile' ); Accession number of the sequence -type => 'hmmsearch', $seq_out = Bio::SeqIO-> new ( -file => $FilePath); '-format' => 'fasta', alphabet() $ foreach $seq ( $HmmRes->each_Set) '-file' => '>$outfile' ); Residues identified as dna, rna or protein foreach $domain ( $seq->each_Domain) seq_version() $ while( ( my $seqobj = $seq_in->next_seq() ) ) Sequence version when available -type can be hmmsearch, hmmpfam { DoSomething with $seqobj } keywords() $ -type Keywords line when available hmmsearch or hmmpfam SEQUENCE FORMATS length() $ Sequence format can be one of the following: Length of the sequence string SEQ (usage: ie. $seq->bits) Format Description Object desc() $ accession abi abi tracefile Description of the sequence Accession number of the qry sequence ace ace format PrimarySeq primary_id() $ bits chadoxml chado xml Primary id for the sequence The bit score for the set of hits embl EMBL Seq::RichSeq display_id() $ description fasta fasta format Seq Display id for the sequence Description of the qry sequence fastq quality revcom $ evalue game game xml Reverse complement of the sequence The evalue of the set of hits translate $ genbank genbank *.gb Seq::RichSeq name Translate sequence qual Phred The name of the query sequence species() Bio::Species scf Standard chrom Species object swiss SwissProt Seq::RichSeq annotation()Bio::Annotation::Reference DNA Strider strider Bio::Annotation::Comment tigr TIGR XML Annotation object tinyseq NCBI TinySeq get_SeqFeatures SeqFeatureI ztr ZTR Tracefile Top level sequence features get_all_SeqFeatures All sequence features (ie. exons etc.) Information at: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Seq.html

Bio::Seq::RichSeq Moreinformation available at http://bioperl.org/wiki/HOWTO:SeqIO BIOPERL HMMER PARSING WINDOWS SOFTWARE

The following sources of software for windows are DOMAIN (usage: ie. $seq->bits) useful for connecting to a Linux box from MS bits Windows or working with programs and files Bit score of the domain match generated on the Linux side. evalue Eval of the domain match Context Text Editor get_nse http://www.context.cx/ Return the name start end A useful program for programming on the hmmacc MS windows machine. It can convert Accession for -type=>hmmpfam between UNIX, Windows, and MAC text hmmname file formats. Name of the domain match seqbit CygWinX Bits for the sequence (eq $seq->bits) http://xfree86.cygwin.com/ seq_id Name of the sequence (eq $seq->name) Putty http://www.chiark.greenend.org.uk/~sgtatham/putty/ start Open source SSH client for windows. Start of the match in the end sequence end Unix Utilities For Windows End of the match in the end sequence http://unxutils.sourceforge.net/ hstart A number of Linux/Unix programs that run Start of the match in the hit sequence in the native windows envrionment. hend Programs include gzip, bzip, grep, tar and End of the match in the hit sequence less. Just these in the directory: C:/Windows/System32 and you will be able to use them from the windows command line.

XwinLogin http://www.calcmaster.net/visual-c++/xwinlogon/

James C. Estill [email protected] Sept 19, 2006