EMBnet Course: Unix & PERL for biomedical researchers

Lausanne, 2 March 2005

Lorenza Bordoli

Swiss Institute of

Outline

• What is EMBOSS? • Major programs • Running EMBOSS Programs from the Unix command line • Combining EMBOSS with Perl Why EMBOSS ?

History: • Wisconsin (sequence analysis) package, GCG (Genetics Computer Group) founded in 1982 as a service of the Department of Genetics at the University of Wisconsin; • Widely used, sources available for inspection (programs could be algorithmically verified and adapted to needs); • Since 1998 EGCG (extended GCG) developed academic add-on, started as a small collection of programs to support EMBL's research activities, in particular the development of automated DNA ; • GCG became a private company in 1990, now belongs to Accelerys; • Currently sources not freely available anymore, no longer possible to distribute academic software source code which uses the GCG libraries! •1999 - EGCG split from GCG to become EMBOSS;

February 2005: version 2.9.0

What is EMBOSS ?

• http://emboss.sourceforge.net/

• EMBOSS, The European Molecular Biology Open Source Software Suite, is a package of high-quality FREE Open Source software for sequence analysis;

• EMBOSS includes hundreds of applications (+150). They share a similar interface, but each comes with its own documentation:

• Many sequence analysis & display programs. • Protein 3D structure prediction being developed. • Other assorted programs, eg: enzyme kinetics. • Extensible (with some C programming knowledge)!

• Complete list of the programs in the currently release: http://emboss.sourceforge.net/apps/#Overview

• Grouped applications: http://emboss.sourceforge.net/apps/groups.html EMBOSS !

• Free Open Source (for most Unix platforms, including MacOSX)

• GCG successor (compatible with GCG file format)

• Public domain (GNU Public License)

• Written by HGMP/Sanger/EBI/Denmark … etc

• Easy to install locally: but no interface, requires local databases Unix command-line only

• Interfaces: Jemboss, www2gcg, w2h, wEMBOSS… (with account) Pise, EMBOSS-GUI, SRS (no account) Staden, Kaptain, CoLiMate, Jemboss (local)

History (cont.):

• The UK Medical Research Council is to close the Research and Bioinformatics Divisions of the RFCGR (the current home of EMBOSS) in the middle of 2005. The MRC Press Office has stated: "All MRC can say at this stage is that Council have made a decision to close the Research and Bioinformatics Divisions. However, the Director has been asked to draw up a closing down plan for consideration by Council in July."

• “This action will more than halve the current core development team and will therefore adversely affect the development and support of EMBOSS. We hope that alternative sources of funding can be found.”

• EMBOSS has moved to SourceForge.net (http://sourceforge.net/); EMBOSS: Introduction

• The EMBOSS package consists of a large number of separate programs that have a specific function.

• They usually take a (number of) input file(s) and some parameters that are important to the function and produce output in the form of files, plots, web pages or simple text output.

Running EMBOSS Programs

Remote server: EMBOSS programs are run by: sib-dea.unil.ch • typing them at the UNIX prompt, • or by using a graphical interface.

Remote server: you personal Local computer: account your PC in the lab, in the course room,… Running EMBOSS Programs

Remote server: EMBOSS programs are run by: sib-dea.unil.ch • typing them at the UNIX prompt: X-terminal (X window system)

Local computer

Running EMBOSS Programs

Remote server: EMBOSS programs are run by: sib-dea.unil.ch • or by using an interface: web based (browser), java based

embl:m93650

Local computer Graphical interfaces to EMBOSS

• Jemboss: java based interface to EMBOSS • wEMBOSS: web interface to EMBOSS • Pise: web interface to EMBOSS • others : http://emboss.sourceforge.net/

Access to EMBOSS: graphical interfaces http://emboss.ch.embnet.org

Access to EMBOSS: graphical interfaces

• Jemboss: account needed on EMBnet machine (->ask Laurent) • Pise: no account needed: anonymous access http://www.bc2.unibas.ch/

Jemboss/wEMBOSS: unibasel account needed

Some major programs:

• General: wossname lists all EMBOSS programs showdb shows the available databases

• Sequence retrieval: seqret retrieves and/or changes format of a sequence seqretset retrieve and or change formats of a number seqretall of sequences at once transeq translate a DNA sequence to protein backtranseq translate a protein sequence to DNA extractseq extract regions from a sequence cutseq remove a region from a sequence pasteseq inserts a sequence into another sequence infoseq display information about a sequence splitter split a sequence into smaller sequences Some major programs (cont.):

• Sequence comparison needle Needleman-Wusch water Smith-Waterman sequence alignment stretcher Myers and Miller global alignment matcher Huang and Miller local alignment dottup dotmatcher dotplot comparisons of two sequences. prettyplot plots multiple sequence alignments polydot supermatcher dotplot comparisons of multiple sequences emma ClustalW program

• Sequence parameters cusp generates a codon usage table syco synonymous codon usage plot dan calculates DNA/RNA melting temperature compseq sequence composition tables

Some major programs (cont.):

• DNA Sequence features remap restriction map of the sequence cpgplot cpgreport CpG island detection etandem einverted finds tandem and inverted repeats plotorf plots potential ORFs showorf pretty display of potential ORFs fuzznuc DNA pattern search tfscan scans sequence for TF binding sites Some major programs (cont.):

• Protein Sequence features ief Isoelectric point calculation antigenic Finds potential antigenic sites digest protein digestion map findkm Vmax and Km calculations fuzzpro protein pattern search garnier protein 2D structure prediction helixturnhelix finds nucleic acid binding motifs octanol pepwindow displays protein hydropathy patmatdb patmatmotifs searching with motifs vs protein sequences pepcoil predicts coiled coil regions pepinfo pepstats Protein information Hammer package ehmmpfam, ehmmsearch, ehmmbuild, … Phylip package efitch, edolpenny, edollop, …

Working with sequences :

• EMBOSS reads sequences from files or databases.

• It automatically recognizes the input sequence format.

• You can easily specify many output formats. Uniform Sequence Address (USA)

• = a standard way of specifying a sequence to be read into a program in EMBOSS • Sequences can be in databases or in files • It has the following syntax:

format::database:entry format::file:entry

In general, a USA specifies • what sequence format to expect • what file or database to open • what entry to look for

Uniform Sequence Address (USA)

format::database:entry

• Of these only the “file”or “database” are necessary; • If format is omitted: EMBOSS will check and recognizes the format (occasionally needs a hint) * ; • if the “entry “ part is omitted, all of the entries in the file or database are read in;

* EMBOSS recognizes: GCG, FASTA, ClustalW, MSF, EMBL, GenBank, DNAStrider, Phylip, PIR, PAUP,ASN.1, NBRF, Fitch, IntelliGenetics Uniform Sequence Address (USA)

The most common ways of specifying a sequence are: • to type the name of the file that the sequence is in: myfile.seq • or type “db:entry”, where “db” is the name of the database and “entry” is either the ID or the accession number (AC) of the sequence in the database

Ex.: database:accession embl:X65923 database:ID swissprot:opsd_xenla file name myfile.seq

ACs and IDs …

• An entry in a database: uniquely identified in that database. Most sequence databases have two such identifiers for each sequence - an ID name and an Accession number.

• Why are there two such identifiers? •The ID name: a human-readable name that had some indication of the function of its sequence: OPSD_HUMAN in Swiss-Prot !! ID names are not guaranteed to remain the same between different versions of a database.

• Accession numbers: unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database: P08100. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers. Databases

You can easily find out what are the database name in your EMBOSS installation by running the showdb program:

Unix % showdb Displays information on the currently available databases #Name Type ID Qry All Comment #======sw P OK OK OK Swiss-Prot section of UniProt swiss P OK OK OK Swiss-Prot section of UniProt swiss-prot P OK OK OK Swiss-Prot section of UniProt trembl P OK OK OK TrEMBL section of UniProt P OK OK OK UniProt (Swiss-Prot & TrEMBL), …

Databases

#Name Type ID Qry All Comment #======sw P OK OK OK Swiss-Prot section of UniProt

•ID allows programs to extract a single explicitly named entry from the database: embl:x13776 ;

• Query indicates that programs can extract a set of matching wildcard entry names: sw:opsd_* ;

• All allows programs to analyze all entries in the database sequentially: embl:* ; Uniform Sequence Address (USA)

• you may also use: filename all sequences in a file filename:entry an entry in a file dbname all sequences in a database (not recommended) dbname:entry a sequence in a database @listfile a list file list::listfile a list file asis::sequence a specific short sequence

Specifying a List File

• Instead of containing the sequences themselves, a listefile contains “references” to sequences using any valid USA. • Example of a ListFile: opsd_abyko.fasta : the name of a sequence file; sw:opsd_xenla : a specific sequence in the Swiss-Prot database; @anotherlist : the name of a second list file;

• Blank lines and lines starting with a '#' character are ignored in List Files: a way to add your comments: this won’t be read by the programs. Specifying a List File

• Instead of containing the sequences themselves,#This a listefileis an examplecontains “references” to sequences using any valid USA.#of a list file • Example of a ListFile: opsd_abyko.fasta : the name of a sequenceopsd_abyko.fasta file; sw:opsd_xenla : a specific sequencesw:opsd_xenla in the Swiss-Prot database; @anotherlist : the name of a second@anotherlist list file;

• Blank lines and lines starting with a '#' character are ignored in List Files: a way to add your comments: this won’t be read by the programs.

Specifying a sequence “As Is”

• The simplest USA format is “asis” format. This is used to specify a sequence immediately without it having to be in a file or a database. • The syntax is “asis::sequence”: asis::atgctagcttagc : for the sequence “atgctagcttagc” ; Sequence Formats

• Sequences can be read and written in a variety of formats;

• Sequences are stored in databases or in files as simple text (ASCII text);

• Microsoft WORD format is not a sequence format (save the file as text *.txt file!!!)

• The default sequence file format is fasta: >xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcatt xyz: ID name

Format: EMBL, GenBank, SwissProt, PIR;

• Sequence File: Files can hold sequences in standard recognized formats;

Sequence Formats

• Currently input/output supported formats (more than 42): http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

• Input Sequence Formats: fasta, EMBL (embl/em), Swiss-Prot (swissprot/swiss/sw), GCG (gcg), MSF (msf), Genbank (), raw,…

• Output Sequence Formats: embl, gcg, swiss, CLUSTALW (, aln), genbank, … Running EMBOSS from the Unix command line

The command Line

• EMBOSS programs are called by giving their name on the UNIX command line either: • without parameters: interactive way, query-answer session with the user, in which the user is asked (prompted) to enter a piece of information one at a time; • or with parameters: the program will have all the information it needs all at once; The command Line

• EMBOSS programs are called by giving their name on the UNIX command line:

• without parameters: interactive way, query-answer session with the user, in which the user is asked (prompted) to enter a piece of information one at a time:

% seqret Reads and writes (returns) a sequence Input sequence: embl:xlrhodop Output sequence [xlrhodop.fasta]:

% more xlrhodop.fasta >XLRHODOP L07770 Xenopus laevis rhodopsin ggtagaacagcttcagttgggatcacaggcttctagggatccttt gggcaaaaaagaaac acagaaggcattctttctatacaagaaaggactttatagagctgc taccatgaacggaac

The command Line

• EMBOSS programs are called by giving their name on the UNIX command line: • without parameters: interactive way, query-answer session with the user, in which the user is asked (prompted) to enter a piece of information one at a time: % wossname Finds programs by keywords in their one-line documentation Keyword to search for: restrict SEARCH FOR 'RESTRICT’ recode Remove restriction sites but maintain the same translation remap Display a sequence with restriction cut sites, translation Etc… The file “stdout”

• Note that the default output file for wossname is: stdout (Standard output)

• Use this whenever prompted for an output file.

• This is a ‘magic’ file name.

• It displays the output on the screen, not a file.

The command Line

• EMBOSS programs are called by giving their name on the UNIX command line:

• with parameters: the program will have all the information it needs all at once:

• % seqret -sequence filename.seq -outseq xlrhodop.fasta

• programname • parameter/object: here: a sequence file • qualifier: specify the properties of that object/parameter. here: input/output sequence The command Line and parameters

% seqret -sequence filename.seq -outseq xlrhodop.fasta

programname parameter/object: here: a sequence file qualifier: specify the properties of that object/parameter. here: input/output sequence

• There are 3 classes of parameters: • standard (mandatory) : minimum input for the program, the program will prompt for them; ex: sequence file name; • additional (optional) : you will be prompted only if you use the –options (-opt) qualifier; ex: begin and end of the sequence; • advanced : you have to specify them on the command line (you won’t be prompted);

The command Line and parameters

• There are 3 classes of parameters: • standard (mandatory) : minimum input for the program, the program will prompt for them; ex: sequence file name; • additional (optional) : you will be prompted only if you use the –options qualifier; ex: begin and end of the sequence; • advanced : you have to specify them on the command line (you won’ be prompted);

• How do I find them out? With the –help qualifier! (try also –verbose) % Programname -help Qualifiers

% wossname –help (-verbose) Mandatory qualifiers: [-search] string Enter a word or words here.

Optional qualifiers: -outfile outfile Output file name

Advanced qualifiers: -[no]emboss bool EMBOSS program documentation will be searched.

"-sequence" related qualifiers

• -sbegin integer first base used • -send integer last base used, def=seq length • -sreverse bool reverse (if DNA) • -sask bool ask for begin/end/reverse • -slower bool make lower case • -supper bool make upper case • -sformat string input sequence format "-outseq" related qualifiers

• -osformat string output sequence format • -ossingle bool separate file for each entry

•…

Boolean options (Yes/No, True/False)

• -thing, -nothing • -thing=T, -thing=F • -thing=1, -thing=0 • -thing=Y, -thing=N General qualifiers

• These can be used with any program:

-help Prints a summary of the options the program can take. With -verbose it gives a more detailed list. -options Prompt the user for the optional parameters -auto Accept all the default settings and run without prompting the user (used when running program scripts) -sask Ask for the start, end and reverse of the sequence input -stdout Print output to stdout (the screen) instead of to a file. -filter Take input from stdin (keyboard) and output to stdout -warnings Report warnings -error Report errors

Example: Seqret

% seqret -help -verbose

• Give seqret all of its data on the command-line :

% seqret embl:xlrhodop -outseq xlrhodop.fasta

• Even shorter, leave out the qualifier:

% seqret embl:xlrhodop xlrhodop.fasta Example: Seqret

• seqret can reformat sequences by specifying the output format: % seqret embl:xlrhodop –osformat gcg -outseq xlrhodop.gcg

• equivalent to:

% seqret embl:xlrhodop xlrhodop.gcg –osformat gcg

• equivalent to: % seqret embl:xlrhodop gcg::xlrhodop.gcg

Unix % more xlrhodop.gcg !!NA_SEQUENCE 1.0 Xenopus laevis rhodopsin mRNA, complete cds. XLRHODOP Length: 1684 Type: N Check: 9453 .. 1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca 51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga

Practical:

• Try running wossname: Can you find a program to: Display multiple alignments. Find ORFs (Open Reading Frames). Translate a sequence. Find restriction enzyme sites Find the isoelectric point of a protein. Do global alignments. Uniform Sequence Address (USA)

• = a standard way of specifying a sequence to be read into a program in EMBOSS • Sequences can be in databases or in files • It has the following syntax:

format::database:entry

In general, a USA specifies • what sequence format to expect • what file or database to open • what entry to look for

Databases

#Name Type ID Qry All Comment #======sw P OK OK OK Swiss-Prot section of UniProt

•ID allows programs to extract a single explicitly named entry from the database: embl:x13776 ;

• Query indicates that programs can extract a set of matching wildcard entry names: sw:opsd_* ;

• All allows programs to analyze all entries in the database sequentially: embl:* ; Asterisk on the command line

•You can't use a ‘*’ on the UNIX command-line.

• UNIX tries to match it to filenames.

• Use it quoted, either with quotes or a backslash: "embl:*" embl:\*

•For example: % seqret “sw:*_HUMAN” Human.seq

The full USA syntax filename : a file containing one or more sequences filename:entry : a given sequence in the file. The ‘entry’ mysequences:opsd_xenla is the ID or AC of the sequence in that file filename:entry[start:end] : a part of the sequence can be specified mysequences:opsd_xenla[1:20]* by the range mysequences:opsd_xenla[-1:-20]* : the last 20 residues/nucleotides mysequences:[1:20:r]* : reverse-complemented (nucleotide sequences)

* In some Unix shell you might have to use backslash: \[ and \[ Specifying Search Fields

• Beside ID names or AC numbers there are other ways to specify sequences. • A typical sequence entry in EMBL format is:

ID HSFAU standard; DNA; UNC; 518 BP. AC X65923; SV X65923.1 DE H.sapiens fau mRNA KW fau gene. OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. SQ Sequence 518 BP; 125 A; 139 C; 148 G; 106 T; 0 other;

• It is also useful to find sequences that contain words occurring in their description filed (“DE” line), their Keyword field (“KE” line), …

Specifying Search Fields

• You can specify which field you are searching by using one of the following Search Field Names: Name Searches for acc Accession number des Description id ID name key Keyword org Organism Name sv Sequence Version/GI Number

• Examples: embl-des:fau : database embl-des:h*emoglobin : database myclones.seq:des:fau : file The full USA syntax: summary

• The full syntax of the possible USAs are: Mandatory parts of the USAs are given in bold text. asis :: Sequence [start : end : reverse] format :: '@' ListFile [start : end : reverse] format :: 'list' : ListFile [start : end : reverse] format :: Database : Entry [start : end : reverse] format :: Database - SearchField : Word [start : end : reverse] format :: File : Entry [start : end : reverse] format :: File : SearchField : Word [start : end : reverse]

Sequence Formats

• Currently input/output supported formats (more than 42): http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

• Input Sequence Formats: fasta, EMBL (embl/em), Swiss-Prot (swissprot/swiss/sw), GCG (gcg), MSF (msf), raw,…

• Output Sequence Formats: embl, gcg, swiss, CLUSTALW (clustal, aln), … Multiple sequences, single file

• EMBOSS writes many sequences to a single file. Most sequence formats can deal with this: Fasta, EMBL, PIR, MSF, Clustal, Phylip, etc. BUT NOT: Plain, Staden and GCG

• EMBOSS reads many sequences from a single file. Use filename:entryname if you wish to specify a single sequence. If there is only one sequence, or you wish to read all entries, use just the filename.

Multiple sequences, many files

• The command-line qualifier “-ossingle” my be useful – it allows you to write out several sequences, but it writes out each sequence to a separate file;

• The name of the file is constructed from the ID name of the sequence and the extension of the file is the format: Ex. : The sequence with the ID name “IXI_567” in would be written to the file “IXI_567.fasta”

% seqret “embl:hsf*” –ossingle

• The program seqretsplit will split an existing multiple sequence file into many files. Alignment output Formats

• Several formats have been written or adopted for EMBOSS output: http://emboss.sourceforge.net/docs/themes/AlignFormats.html • Multiple sequence alignment: fasta, msf,.. • Pair-wise alignment: pair, score,…

• Each program that writes an alignment has a default alignment format defined for that program. However you can change the output formats with the “ –aformat” qualifier: % water –aformat msf

Feature Formats

• A feature is a region of interest in a specified nucleic or protein sequence. It has a specified start and end position. It has a name describing what type of thing it is: Ex: Swiss-Prot Feature table

FT DISULFID 3 40 FT DISULFID 4 32 FT DISULFID 16 26 FT VARIANT 22 22 P -> S (IN ISOFORM SI). FT VARIANT 25 25 L -> I (IN ISOFORM SI).

• When reading or writing features associated with a sequence, there are a standard set of formats that are used: UFO (Universal Feature Object) e.g. Swiss- Prot (swissprot), EMBL (embl), PIR (pir),… http://emboss.sourceforge.net/docs/themes/FeatureFormats.html

• showfeat useful for displaying features. • etractfeat useful for extracting the sequences of features. Feature Formats

• Many programs will read in and use the feature table of an input sequence. Amongst these are diffseq, extractfeat, maskfeat, seqret, showfeat

• The feature table can be already a part of the sequence (which is generally the case when you are reading the sequence from a database) or located in a separate file

• Try the difference between:

% seqret sw:P15423 -osformat swiss –outseq stdout

% seqret sw:P15423 -osformat swiss –outseq stdout -feature

Report Formats

• There are many ways in which the results of an analysis can be reported: http://emboss.sourceforge.net/docs/themes/ReportFormats.html

######################################## # Program: garnier # Rundate: Mon Feb 11 15:14:40 2002 # Report_file: report.dbmotif ########################################

#======# # Sequence: 100K_RAT from: 1 to: 889 # HitCount: 206 # # DCH = 0, DCS = 0 # # Please cite: # Garnier, Osguthorpe and Robson (1978) J. Mol. Biol. 120:97-120 #------# # Residue totals: H:364 E:149 T:191 C:185 # percent: H: 41.7 E: 17.1 T: 21.9 C: 21.2 # # #------Report Formats

• Many EMBOSS programs are now able to output their results in a standard report format - you can change the report format used by putting '-rformat name' on the command-line, where 'name' is the name of one of the standard report formats.

• examples: embl Writes a report in EMBL feature table format pir Writes a report in PIR feature table format swiss Writes a report in SwissProt feature table format excel This is a TAB-delimited table format suitable for reading into spread-sheet programs such as Excel. seqtable A simple table format that includes the feature sequence

Start End [tagnames] Sequence [start] [end] [tagvalues] [sequence]

Graphic Formats

• -graph

• output as X11 (default), PNG, ps, tektronics amongst other

• CAVEAT: X11 default output doesn’t work with the X-terminal from a PC or Mac, use “ps” instead. ps files can be opened with the Unix program gs. Or save the output as PNG. PNG files can be opened from PC or Mac.

• Ex. % plotorf embl:xlrhodop

• Plot potential open reading frames Example: plotorf

• Ex. % plotorf embl:xlrhodop –graph PNG

Plot potential open reading frames

Graphic Formats

[X] x11 Output to an X-window postscript Output to a postscript file (good for printing on a laser printer) cps Output to a color postscript file text Output to a text file data Output XY data points to a file. (good for importing into a graphing package) [P] png Output to a PNG image file (good for web pages) [X] Tek Output to tektronics terminal [X] xterm Output to an Xterm window

[X]- requires X-windows [P] – requires PNG support

The default filename is prog.format eg. octanol.ps The file “stdout”

• There is a “magic” filename that you can give whenever an output filename is request: “stdout”;

• If you enter this name, then the resulting sequence will not go to a file called “stdout”, BUT it will be printed on the screen;

• Useful when you wish to know the results immediately, or are testing various way of running a program and wish to quickly see the results

•You can specify the output to the screen by “format::stdout”: gcg::stdout

Reminder - help

• If in doubt, use: wossname program –help -verbose program -opt tfm program The friendly manual ;) What EMBOSS does NOT

• The major deficiencies in the EMBOSS package are: BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…)* , PAUP, sequence editor

• Don’t worry those programs (and much more!) are also installed on the machines running EMBOSS in Lausanne & Basel (http://www.bc2.unibas.ch/BC2/programs/bc2soft)

• BLAST package (type blastall command) • FASTA package (fasta34,…) • PAUP (paup) • * PHRED, PHRAP, Consed • sequence editor: pico, emacs, vi

What EMBOSS does NOT

• The major deficiencies in the EMBOSS package are: BLAST, FASTA, ASSEMBLY (GelMerge, GelEnter,…) , PAUP, sequence editor

• Graphical Interface:

• BLAST: • SIB: http://www.expasy.org/tools/blast/ • Swiss EMBnet: http://www.ch.embnet.org/software/BottomBLAST.html? • NCBI: http://www.ncbi.nlm.nih.gov/BLAST/ • FASTA: EBI: http://www.ebi.ac.uk/fasta33/ • ClustalW: •Swiss EMBnet: http://www.ch.embnet.org/software/ClustalW.html • PAUP no graphical interface, use Phylip instead (part of EMBOSS) Send the output of a program as input for a second program

% seqret embl:xlrhodop\[110:1174\] -stdout -auto | transeq –filter

-stdout Print output to stdout (the screen) instead of to a file. -filter Take input from stdin (keyboard) and output to stdout -auto Accept all the default settings and run without prompting the user

Combining EMBOSS with Perl

input program1 output program2 Combining EMBOSS with Perl

program1 Parsed program2 input output output/input

Combining EMBOSS with Perl

program1 Parsed program2 input output output/input

embedded in a single Perl script Combining EMBOSS with Perl

How to run an external programs from a Perl script?

1) System function: launches a child process which run a program: system (“seqret embl:xlrhodop –outseq xlrhodop.fasta –auto”); ⇒ Perl waits for it to finish, then possibly grabs its output.

2) Processes as Filehandles: launches a child process that stays alive, communicating via pipes to Perl until the task is complete: open (SEQRET, “ seqret embl:xlrhodop –auto |” ); my @sequence = ; … ⇒ These filehandles “contains” the output of the launched command.

Combining EMBOSS with Perl: an example: cirdna Start 1001 End 4270 group label Block 1011 1362 3 ex1 endlabel label Tick 1610 8 EcoR1 endlabel label Block 1647 1815 1 endlabel label Tick 2459 8 BamH1 endlabel label Block 4139 4258 3 ex2 endlabel endgroup group label Range 2541 2812 [ ] 5 Alu endlabel label Range 3322 3497 > < 5 MER13 endlabel Draws circular maps of DNA constructs endgroup restrict

######################################## # Program: restrict # Rundate: Sat Feb 26 21:53:07 2005 # Report_format: table # Report_file: ab014641.restrict ########################################

#======# # Sequence: AB014641 from: 1 to: 3417 Cloning vector pGEX-PUC-3T # HitCount: 9 # # Minimum cuts per enzyme: 1 # Maximum cuts per enzyme: 1 # Minimum length of recognition site: 4 # Blunt ends allowed # Sticky ends allowed # DNA is circular # Ambiguities allowed Finds restriction enzyme cleavage sites # #======

Start End Enzyme_name Restriction_site 5prime 3prime 5primerev 3primerev 1 8 NotI GCGGCCGC 2 6 . . 556 561 PvuI CGATCG 559 557 . . 856 861 Eco31I GGTCTC 862 866 . . 1335 1331 BcefI ACGGC 1317 1318 . . 1998 2003 XhoI CTCGAG 1998 2002 . . 2170 2175 HindII GTYRAC 2172 2172 . . 2680 2685 BclI TGATCA 2680 2684 . . 2918 2923 BamHI GGATCC 2918 2922 . . 3412 3417 HindIII AAGCTT 3412 3416 . .

#------#------

restrict parsed output

Start 1 End 3417

group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 HindIII endlabel endgroup cirdna.pl

#!/usr/local/bin/perl

#perl script to 1. parse the output of the EMBOSS program "restrict" which uses #the REBASE database of restriction enzymes to predict cut sites in a DNA sequence. #2. write a input file for the EMBOSS program "cirdna" and 3. run the "cirdna" #program which draws circular maps od DNA construct

use strict; use warnings;

#if no filename is specified on the command-line print USAGE and exit #the file should contain a DNA sequence in one of the EMBOSS recognized formats

my $USAGE=" SYNOPSIS $0 filename\n"; unless (@ARGV){ print $USAGE ; exit; }

#Read in the filename from the argument on the command line my $filename=$ARGV[0];

cirdna.pl

#write the parsed output of the "restrict program" in a file my $inputfile = "$filename.restrict"; unless (open(OUT, ">$inputfile")){ print " Could not open file $inputfile! to write to !! \n\n"; exit; } cirdna.pl

#launch the emboss program "restrict" to do a restriction analysis of #the DNA #the maximum number of cuts for any restriction enzyme that will be #considered is set to 1 #a list of enzymes is provided and we specify that the DNA is circular #the command is: #"restrict filename -enzymes HindIII,XhoI -max=1 -plasmid=Y -auto #-stdout" open (RESTRICT, "restrict $filename -enzymes NotI,PvuI -max=1 -plasmid=Y -auto -stdout |");

cirdna.pl

#Read the output in a "while" loop and extract the start and the #beginning of the DNA sequence #and the start, the end and the name of the restriction enzyme my $line; while ($line = ) { chomp($line); #regular expression to match start and the end of the sequence if ($line =~ /^#\sSequence.*from:\s+(\d+).*to:\s+(\d+)/){ my ($seqstart, $seqend) = ($1, $2);

#print the DNA start and the beginning in the input file print OUT "Start\t$seqstart\nEnd\t$seqend\n\ngroup\n"; } #regular expression to match start, end and name of the enzyme if ($line =~ /^\s+([0-9]+)\s+([0-9]+)\s+(\w+)\s+/){ my ($start,$end,$enzyme) = ($1,$2,$3); print OUT "label\nTick\t$start\t8\n$enzyme\nendlabel\n"; } } print OUT "endgroup"; close (OUT); cirdna.pl

#launch the emboss program "cirdna" to draw a circular maps od DNA #the command is: "cirdna inputfile –goutfile outputfile -graph ps -auto" my $cmd="cirdna $inputfile -goutfile $filename -graph ps -auto"; system ("$cmd"); exit;

restrict parsed output

Start 1 End 3417

group label Tick 1 8 NotI endlabel label Tick 556 8 PvuI endlabel label Tick 856 8 Eco31I Endlabel (...) label Tick 2918 8 BamHI endlabel label Tick 3412 8 pGEX-PUC-3T.restrict HindIII endlabel endgroup cirdna.pl output

pGEX-PUC-3T.ps output file

lindna Reminder if you are lost

• If in doubt, use: wossname program –help -verbose program -opt tfm program The friendly manual ;)

References

• http://emboss.sourceforge.net/ • UK HGMP Resource Centre, Userguide, 2002 • www.perl.org