EMBOSS Handle and Analyze Sequences with command lines EMBOSS hp://emboss.sourceforge.net/

EMBOSS is "The European Molecular Biology Open Soware Suite" FREE EQUIVALENT TO GCG EMBOSS is a free Open Source soware analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community.

The soware automacally copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.

EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial soware packages Within EMBOSS you will find around 150 programs (applicaons). A UNIX command line interfaces: •Type the name of the program •Add any opons you want to the command-line •Press RETURN ! •Any mandatory informaon that was not on the command-line will be prompted for. hp://emboss.sourceforge.net/apps/groups.html

EMBOSS applicaons

Within EMBOSS you will find over 150 programs: • Sequence alignments • Rapid database searching with sequence paerns • Protein mof idenficaon, including domain analysis • EST analysis • Nucleode sequence paern analysis, for example to idenfy CpG islands. • Simple and species-specific repeat idenficaon • Codon usage analysis for small genomes • Rapid idenficaon of sequence paerns in large scale sequence sets. • Presentaon tools for publicaon • And much more !

OUTPUT: text or image files EMBOSS applicaons wossname: a first EMBOSS applicaon All EMBOSS programs run from the Unix command line. We'll introduce the basics with a specific example: the EMBOSS ulity wossname will produce a list of all the various EMBOSS applicaons. wossname Finding EMBOSS programs wossname –help

wossname –help -verbose

exercise Find alignment programs in EMBOSS hp://emboss.sourceforge.net/apps/release/6.4/emboss/apps/seqret.html

EMBOSS applicaons

program –help -verbose Help EMBOSS applicaons Working with sequences Change format: seqret equivalent of readseq

seqret Reads and writes (returns) a sequence

Two ways to use Emboss programs with Terminal: 1) Use the prompt for basic use: Type seqret

2) Use full command for advanced use with opons: Type seqret –h (-verbose) EMBOSS applicaons Let’s start Go to EMBOSS-exercise directory

cd Path to the directory cd EMBOSS-exercise ls -l hp://emboss.sourceforge.net/docs/themes/SequenceFormats.html

EMBOSS applicaons Change format (reformang): seqret equivalent of readseq Let’s start seqret –sequence dot1.fasta –outseq dot1.out -osformat2 embl seqret –sequence 454.fna –outseq 454.embl -osformat2 embl seqret can change your sequence with 39 output formats EMBOSS applicaons Informaon about sequences infoseq Display basic information about sequences

infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length

seqcount count the number of sequence (= grep –c ‘>’)

exercise calculate the number of sequences in the 454.fna file calculate the sum of bases of the 454.fna file EMBOSS applicaons Informaon about sequences infoseq Display basic information about sequences

infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length

seqcount count the number of sequence (= grep –c ‘>’)

exercise calculate the number of sequences within the multipleseq.fasta file calculate the sum of bases of the multipleseq.fasta file grep –c ‘>’ multipleseq.fasta infoseq -nohead -only -length multipleseq.fasta | awk ‘{sum += $1}END {print sum}’ EMBOSS applicaons Edit sequences sizeseq filter sequences by size seqretsplit Reads seq. and writes them to individual files

exercise search the longest and shortest sequence in 454.fna write sequences from multipleseq2.fasta into individual files. Don’t do it with the 454.fna file !

sizeseq 454.fna infoseq 454.fna.sizeseq | head –n1 infoseq 454.fna.sizeseq | tail –n1 EMBOSS applicaons Edit sequences cutseq Removes a section from a sequence extractseq Extract regions from a sequence maskeseq Write a sequence with masked regions

revseq Reverse and complement a nucleotide sequence makenucseq makeprotseq Create random sequence

splitter Split sequence into smaller sequences exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. EMBOSS applicaons Edit sequences

cutseq Removes a section from a sequence

extractseq Extract regions from a sequence

maskeseq Write a sequence with masked regions

revseq Reverse and complement a nucleotide sequence

makenucseq makeprotseq Create random sequence

splitter Split sequence into smaller sequences

exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase

exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. extractseq –sequence BAC.embl –regions ‘1-1000’ –outseq BAC.part –sreverse1 -slower1 splitter -sequence BAC.embl –size 92 –overlap 91 EMBOSS applicaons Edit sequences

extractalign extract regions from alignement

skipredundant remove redundant sequences

trimest remove poly A tail

trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format EMBOSS applicaons Edit sequences

extractalign extract regions from alignement

skipredundant remove redundant sequences

trimest remove poly A tail

trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format

seqret –sequence multiplieseq.fasta:AFP0AAA1YB10RM1 –osformat2 embl EMBOSS applicaons Edit sequences remap Display restriction enzyme sites in a nuc. sequence

remap BAC.embl Comma separated enzyme list [all]: Minimum recognition site length [4]: Output file [bac.embl] EMBOSS applicaons Sequences Annotaon showfeat Show features of a sequence showfeat BAC.embl –pos –sort type extractfeat Extract features from sequence

maskfeat Write a sequence with masked features

coderet Extract CDS mRNA and translations from feature

exercise extract CDS from the BAC.embl file with the EMBL format

extractfeat -sequence BAC.embl -type CDS –join EMBOSS applicaons Pairwise : DOTTUP – comparison between 2 seq using dotplots dottup Make a dotplot dottup dot1.fasta dot2.fasta EMBOSS applicaons Pairwise sequence alignment: Global alignment A global alignment is one that compares the two sequences over their enre lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program.

needle Needleman-Wunsch global alignment

stretcher Needleman-Wunsch global alignment

esim4 Align an mRNA to a genomic DNA seq.

est2genome Align EST to a genomic DNA seq.

exercise do alignment between AF454632.fasta and BAC2.fasta using needle, est2genome and dottup EMBOSS applicaons Pairwise sequence alignment: Local alignment Local alignment searches for regions of local similarity and need not include the enre length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains.

water Smith-Waterman local alignment of seq.

exercise do alignment between AF454632.fasta and BAC2.fasta using water EMBOSS applicaons Mulple Sequence Analysis Mulple sequence alignments are used: To find paerns to characterize protein families. To detect or demonstrate homology between new sequence and exisng families of sequences. To help predict the secondary and terary structures of the new sequences. As an essenal prelude to molecular evoluonary analysis. edialign Local multiple sequence alignment (DIALIGN) emma Multiple sequence alignment (ClustalW wrapper) infoalign information about alignment showalign display a multiple alignment prettyplot display aligned sequences for publication

exercise do multiple alignment using the multipleseq2.fasta file, display the alignment information, display the alignment as follow: only non-identical nucleotide, with a width of 50 characters in html format and finally display the alignment with prettyplot EMBOSS applicaons Mulple Sequence Analysis plotcon Plot conservation of a sequence alignment exercise plot conservation of the previous alignment, with different window sizes

prophecy Creates matrices/profiles from multiple alignments

out of prophecy can be used by program profit or prophet EMBOSS applicaons Mulple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames

transeq Translate nucleic acid sequences

exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence

EMBOSS applicaons Mulple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames

transeq Translate nucleic acid sequences

exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence

getorf –sequence multipleseq.fasta –minsize 300 seqcount outseq grep '>' outseq | sed 's/>//' | sed 's/_.*//' | sort | uniq | wc EMBOSS applicaons Protein & nucleode analysis garnier Predicts protein secondary structure pepinfo Plots simple amino acid properties in parallel

example getorf AEK11795.fasta pepinfo AEK11795.orf

EMBOSS applicaons Protein & nucleode analysis patmatmotifs Search a PROSITE motif DB with prot seq pscan Scans proteins using PRINTS

fuzznuc Use PROSITE style patterns to search nucleotide sequences Pattern:[ATGC] stands for A or T or G or C {A} any bases except A N(4) NNN N(2,4) NN or NNN or NNNN

search for [AT](4)TG{A}N(1,5)G EMBOSS applicaons Sequence analysis cpgplot Plot the CpG rich areas cusp Create a codon usage table from nucleotide sequence

example cusp AF454632.fasta

eprimer3 Picks PCR primers and hybridization oligos

einverted Find DNA inverted repeats equicktandem Find tandem repeats EMBOSS applicaons But also Phylogeny HMM Nucleic and protein structure ontotogy EMBOSS in the Web: graphic applicaons (Jemboss)

hp://bips.u-strasbg.fr/EMBOSS/ hp://emboss.bioinformacs.nl/

Now EMBOSS is available in Galaxy hps://main.g2.bx.psu.edu/

Exercise

Propose a script to do : Extract first 10,000 sequences from a nucleode mulfasta file (454.fna) Blast this file against a protein database with a minimum evalue of e-4. How many sequences with similarity to Gypsy? Extract sequence with similarity to Gypsy