EMBOSS Handle and Analyze Sequences with command lines EMBOSS h p://emboss.sourceforge.net/
EMBOSS is "The European Molecular Biology Open So ware Suite" FREE EQUIVALENT TO GCG EMBOSS is a free Open Source so ware analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community.
The so ware automa cally copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web.
EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial so ware packages Within EMBOSS you will find around 150 programs (applica ons). A UNIX command line interfaces: •Type the name of the program •Add any op ons you want to the command-line •Press RETURN ! •Any mandatory informa on that was not on the command-line will be prompted for. h p://emboss.sourceforge.net/apps/groups.html
EMBOSS applica ons
Within EMBOSS you will find over 150 programs: • Sequence alignments • Rapid database searching with sequence pa erns • Protein mo f iden fica on, including domain analysis • EST analysis • Nucleo de sequence pa ern analysis, for example to iden fy CpG islands. • Simple and species-specific repeat iden fica on • Codon usage analysis for small genomes • Rapid iden fica on of sequence pa erns in large scale sequence sets. • Presenta on tools for publica on • And much more !
OUTPUT: text or image files EMBOSS applica ons wossname: a first EMBOSS applica on All EMBOSS programs run from the Unix command line. We'll introduce the basics with a specific example: the EMBOSS u lity wossname will produce a list of all the various EMBOSS applica ons. wossname Finding EMBOSS programs wossname –help
wossname –help -verbose
exercise Find alignment programs in EMBOSS h p://emboss.sourceforge.net/apps/release/6.4/emboss/apps/seqret.html
EMBOSS applica ons
program –help -verbose Help EMBOSS applica ons Working with sequences Change format: seqret equivalent of readseq
seqret Reads and writes (returns) a sequence
Two ways to use Emboss programs with Terminal: 1) Use the prompt for basic use: Type seqret
2) Use full command for advanced use with op ons: Type seqret –h (-verbose) EMBOSS applica ons Let’s start Go to EMBOSS-exercise directory
cd Path to the directory cd EMBOSS-exercise ls -l h p://emboss.sourceforge.net/docs/themes/SequenceFormats.html
EMBOSS applica ons Change format (reforma ng): seqret equivalent of readseq Let’s start seqret –sequence dot1.fasta –outseq dot1.out -osformat2 embl seqret –sequence 454.fna –outseq 454.embl -osformat2 embl seqret can change your sequence with 39 output formats EMBOSS applica ons Informa on about sequences infoseq Display basic information about sequences
infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length
seqcount count the number of sequence (= grep –c ‘>’)
exercise calculate the number of sequences in the 454.fna file calculate the sum of bases of the 454.fna file EMBOSS applica ons Informa on about sequences infoseq Display basic information about sequences
infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length
seqcount count the number of sequence (= grep –c ‘>’)
exercise calculate the number of sequences within the multipleseq.fasta file calculate the sum of bases of the multipleseq.fasta file grep –c ‘>’ multipleseq.fasta infoseq -nohead -only -length multipleseq.fasta | awk ‘{sum += $1}END {print sum}’ EMBOSS applica ons Edit sequences sizeseq filter sequences by size seqretsplit Reads seq. and writes them to individual files
exercise search the longest and shortest sequence in 454.fna write sequences from multipleseq2.fasta into individual files. Don’t do it with the 454.fna file !
sizeseq 454.fna infoseq 454.fna.sizeseq | head –n1 infoseq 454.fna.sizeseq | tail –n1 EMBOSS applica ons Edit sequences cutseq Removes a section from a sequence extractseq Extract regions from a sequence maskeseq Write a sequence with masked regions
revseq Reverse and complement a nucleotide sequence makenucseq makeprotseq Create random sequence
splitter Split sequence into smaller sequences exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. EMBOSS applica ons Edit sequences
cutseq Removes a section from a sequence
extractseq Extract regions from a sequence
maskeseq Write a sequence with masked regions
revseq Reverse and complement a nucleotide sequence
makenucseq makeprotseq Create random sequence
splitter Split sequence into smaller sequences
exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase
exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. extractseq –sequence BAC.embl –regions ‘1-1000’ –outseq BAC.part –sreverse1 -slower1 splitter -sequence BAC.embl –size 92 –overlap 91 EMBOSS applica ons Edit sequences
extractalign extract regions from alignement
skipredundant remove redundant sequences
trimest remove poly A tail
trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format EMBOSS applica ons Edit sequences
extractalign extract regions from alignement
skipredundant remove redundant sequences
trimest remove poly A tail
trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format
seqret –sequence multiplieseq.fasta:AFP0AAA1YB10RM1 –osformat2 embl EMBOSS applica ons Edit sequences remap Display restriction enzyme sites in a nuc. sequence
remap BAC.embl Comma separated enzyme list [all]: Minimum recognition site length [4]: Output file [bac.embl] EMBOSS applica ons Sequences Annota on showfeat Show features of a sequence showfeat BAC.embl –pos –sort type extractfeat Extract features from sequence
maskfeat Write a sequence with masked features
coderet Extract CDS mRNA and translations from feature
exercise extract CDS from the BAC.embl file with the EMBL format
extractfeat -sequence BAC.embl -type CDS –join EMBOSS applica ons Pairwise sequence alignment: DOTTUP – comparison between 2 seq using dotplots dottup Make a dotplot dottup dot1.fasta dot2.fasta EMBOSS applica ons Pairwise sequence alignment: Global alignment A global alignment is one that compares the two sequences over their en re lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program.
needle Needleman-Wunsch global alignment
stretcher Needleman-Wunsch global alignment
esim4 Align an mRNA to a genomic DNA seq.
est2genome Align EST to a genomic DNA seq.
exercise do alignment between AF454632.fasta and BAC2.fasta using needle, est2genome and dottup EMBOSS applica ons Pairwise sequence alignment: Local alignment Local alignment searches for regions of local similarity and need not include the en re length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains.
water Smith-Waterman local alignment of seq.
exercise do alignment between AF454632.fasta and BAC2.fasta using water EMBOSS applica ons Mul ple Sequence Analysis Mul ple sequence alignments are used: To find pa erns to characterize protein families. To detect or demonstrate homology between new sequence and exis ng families of sequences. To help predict the secondary and ter ary structures of the new sequences. As an essen al prelude to molecular evolu onary analysis. edialign Local multiple sequence alignment (DIALIGN) emma Multiple sequence alignment (ClustalW wrapper) infoalign information about alignment showalign display a multiple alignment prettyplot display aligned sequences for publication
exercise do multiple alignment using the multipleseq2.fasta file, display the alignment information, display the alignment as follow: only non-identical nucleotide, with a width of 50 characters in html format and finally display the alignment with prettyplot EMBOSS applica ons Mul ple Sequence Analysis plotcon Plot conservation of a sequence alignment exercise plot conservation of the previous alignment, with different window sizes
prophecy Creates matrices/profiles from multiple alignments
out of prophecy can be used by program profit or prophet EMBOSS applica ons Mul ple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames
transeq Translate nucleic acid sequences
exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence
EMBOSS applica ons Mul ple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames
transeq Translate nucleic acid sequences
exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence
getorf –sequence multipleseq.fasta –minsize 300 seqcount outseq grep '>' outseq | sed 's/>//' | sed 's/_.*//' | sort | uniq | wc EMBOSS applica ons Protein & nucleo de analysis garnier Predicts protein secondary structure pepinfo Plots simple amino acid properties in parallel
example getorf AEK11795.fasta pepinfo AEK11795.orf
EMBOSS applica ons Protein & nucleo de analysis patmatmotifs Search a PROSITE motif DB with prot seq pscan Scans proteins using PRINTS
fuzznuc Use PROSITE style patterns to search nucleotide sequences Pattern:[ATGC] stands for A or T or G or C {A} any bases except A N(4) NNN N(2,4) NN or NNN or NNNN
search for [AT](4)TG{A}N(1,5)G EMBOSS applica ons Sequence analysis cpgplot Plot the CpG rich areas cusp Create a codon usage table from nucleotide sequence
example cusp AF454632.fasta
eprimer3 Picks PCR primers and hybridization oligos
einverted Find DNA inverted repeats equicktandem Find tandem repeats EMBOSS applica ons But also Phylogeny HMM Nucleic and protein structure ontotogy EMBOSS in the Web: graphic applica ons (Jemboss)
h p://bips.u-strasbg.fr/EMBOSS/ h p://emboss.bioinforma cs.nl/
Now EMBOSS is available in Galaxy h ps://main.g2.bx.psu.edu/
Exercise
Propose a script to do : Extract first 10,000 sequences from a nucleo de mul fasta file (454.fna) Blast this file against a protein database with a minimum evalue of e-4. How many sequences with similarity to Gypsy? Extract sequence with similarity to Gypsy