EMBOSS Handle and Analyze Sequences with Command Lines EMBOSS HP://Emboss.Sourceforge.Net

EMBOSS Handle and Analyze Sequences with command lines EMBOSS hp://emboss.sourceforge.net/ EMBOSS is "The European Molecular Biology Open So7ware Suite" FREE EQUIVALENT TO GCG EMBOSS is a free Open Source so7ware analysis package specially developed for the needs of the molecular biology (e.g. EMBnet) user community. The so7ware automacally copes with data in a variety of formats and even allows transparent retrieval of sequence data from the web. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial so7ware packages Within EMBOSS you will find around 150 programs (applicaTons). A UNIX command line interfaces: •Type the name of the program •Add any opTons you want to the command-line •Press RETURN ! •Any mandatory informaTon that was not on the command-line will be prompted for. hp://emboss.sourceforge.net/apps/groups.html EMBOSS applicaTons Within EMBOSS you will find over 150 programs: • Sequence alignments • Rapid database searching with sequence paerns • Protein moCf idenCficaon, including domain analysis • EST analysis • NucleoCde sequence paern analysis, for example to idenCfy CpG islands. • Simple and species-specific repeat idenCficaon • Codon usage analysis for small genomes • Rapid idenCficaon of sequence paerns in large scale sequence sets. • Presentaon tools for publicaon • And much more ! OUTPUT: text or image files EMBOSS applicaTons wossname: a first EMBOSS applicaon All EMBOSS programs run from the Unix command line. We'll introduce the basics with a specific example: the EMBOSS uClity wossname will produce a list of all the various EMBOSS applicaons. wossname Finding EMBOSS programs wossname –help wossname –help -verbose exercise Find alignment programs in EMBOSS hLp://emboss.sourceforge.net/apps/release/6.4/emboss/apps/seqret.html EMBOSS applicaTons program –help -verbose Help EMBOSS applicaTons Working with sequences Change format: seqret equivalent of readseq seqret Reads and writes (returns) a sequence Two ways to use Emboss programs with Terminal: 1) Use the prompt for basic use: Type seqret 2) Use full command for advanced use with opCons: Type seqret –h (-verbose) EMBOSS applicaTons Let’s start Go to EMBOSS-exercise directory cd Path to the directory cd EMBOSS-exercise ls -l h:p://emboss.sourceforge.net/docs/themes/SequenceFormats.html EMBOSS applicaTons Change format (reforma^ng): seqret equivalent of readseq Let’s start seqret –sequence dot1.fasta –outseq dot1.out -osformat2 embl seqret –sequence 454.fna –outseq 454.embl -osformat2 embl seqret can change your sequence with 39 output formats EMBOSS applicaTons Informaon about sequences infoseq Display basic information about sequences infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length seqcount count the number of sequence (= grep –c ‘>’) exercise calculate the number of sequences in the 454.fna file calculate the sum of bases of the 454.fna file EMBOSS applicaTons Informaon about sequences infoseq Display basic information about sequences infoseq dot1.fasta infoseq -only -length –nohead –sequence dot1.fasta only length seqcount count the number of sequence (= grep –c ‘>’) exercise calculate the number of sequences within the multipleseq.fasta file calculate the sum of bases of the multipleseq.fasta file grep –c ‘>’ multipleseq.fasta infoseq -nohead -only -length multipleseq.fasta | awk ‘{sum += $1}END {print sum}’ EMBOSS applicaTons Edit sequences sizeseq filter sequences by size seqretsplit Reads seq. and writes them to individual files exercise search the longest and shortest sequence in 454.fna write sequences from multipleseq2.fasta into individual files. Don’t do it with the 454.fna file ! sizeseq 454.fna infoseq 454.fna.sizeseq | head –n1 infoseq 454.fna.sizeseq | tail –n1 EMBOSS applicaTons Edit sequences cutseq Removes a section from a sequence extractseq Extract regions from a sequence maskeseq Write a sequence with masked regions revseq Reverse and complement a nucleotide sequence makenucseq makeprotseq Create random sequence splitter Split sequence into smaller sequences exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. EMBOSS applicaTons Edit sequences cutseq Removes a section from a sequence extractseq Extract regions from a sequence maskeseq Write a sequence with masked regions revseq Reverse and complement a nucleotide sequence makenucseq makeprotseq Create random sequence splitter Split sequence into smaller sequences exercise extract a part of the BAC.embl sequence, in reverse and put nucleotide in lowercase exercise from the BAC.embl file use splitter to get sequences with 92 bp and an overlap of 91 bp. extractseq –sequence BAC.embl –regions ‘1-1000’ –outseq BAC.part –sreverse1 -slower1 splitter -sequence BAC.embl –size 92 –overlap 91 EMBOSS applicaTons Edit sequences extractalign extract regions from alignement skipredundant remove redundant sequences trimest remove poly A tail trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format EMBOSS applicaTons Edit sequences extractalign extract regions from alignement skipredundant remove redundant sequences trimest remove poly A tail trimseq remove unwanted characters entret Retrieve a sequence entry seqret Retrieve a sequence entry exercise from the multipseq.fasta file retrieve the AFP0AAA1YB10RM1 sequence and write it in a file with the EMBL format seqret –sequence multiplieseq.fasta:AFP0AAA1YB10RM1 –osformat2 embl EMBOSS applicaTons Edit sequences remap Display restriction enzyme sites in a nuc. sequence remap BAC.embl Comma separated enzyme list [all]: Minimum recognition site length [4]: Output file [bac.embl] EMBOSS applicaTons Sequences AnnotaTon showfeat Show features of a sequence showfeat BAC.embl –pos –sort type extractfeat Extract features from sequence maskfeat Write a sequence with masked features coderet Extract CDS mRNA and translations from feature exercise extract CDS from the BAC.embl file with the EMBL format extractfeat -sequence BAC.embl -type CDS –join EMBOSS applicaTons Pairwise sequence alignment: DOTTUP – comparison between 2 seq using dotplots dottup Make a dotplot dottup dot1.fasta dot2.fasta EMBOSS applicaTons Pairwise sequence alignment: Global alignment A global alignment is one that compares the two sequences over their enCre lengths, and is appropriate for comparing sequences that are expected to share similarity over the whole length. The alignment maximizes regions of similarity and minimizes gaps using the scoring matrices and gap parameters provided to the program. needle Needleman-Wunsch global alignment stretcher Needleman-Wunsch global alignment esim4 Align an mRNA to a genomic DNA seq. est2genome Align EST to a genomic DNA seq. exercise do alignment between AF454632.fasta and BAC2.fasta using needle, est2genome and dottup EMBOSS applicaTons Pairwise sequence alignment: Local alignment Local alignment searches for regions of local similarity and need not include the enCre length of the sequences. Local alignment methods are very useful for scanning databases or other circumstances when you wish to find matches between small regions of sequences, for example, between protein domains. water Smith-Waterman local alignment of seq. exercise do alignment between AF454632.fasta and BAC2.fasta using water EMBOSS applicaTons MulTple Sequence Analysis MulCple sequence alignments are used: To find paerns to characterize protein families. To detect or demonstrate homology between new sequence and exisCng families of sequences. To help predict the secondary and terCary structures of the new sequences. As an essenCal prelude to molecular evoluConary analysis. edialign Local multiple sequence alignment (DIALIGN) emma Multiple sequence alignment (ClustalW wrapper) infoalign information about alignment showalign display a multiple alignment prettyplot display aligned sequences for publication exercise do multiple alignment using the multipleseq2.fasta file, display the alignment information, display the alignment as follow: only non-identical nucleotide, with a width of 50 characters in html format and finally display the alignment with prettyplot EMBOSS applicaTons MulTple Sequence Analysis plotcon Plot conservation of a sequence alignment exercise plot conservation of the previous alignment, with different window sizes prophecy Creates matrices/profiles from multiple alignments out of prophecy can be used by program profit or prophet EMBOSS applicaTons MulTple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames transeq Translate nucleic acid sequences exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence EMBOSS applicaTons MulTple Sequence analysis getorf Finds and extracts open reading frames (ORFs) plotorf plots potential opening reading frames transeq Translate nucleic acid sequences exercise find the number of ORF from the multipleseq.fasta file with a minimum of 300 nucleotides and the number of uniq nucleotide sequence getorf –sequence multipleseq.fasta –minsize 300 seqcount outseq grep '>' outseq | sed 's/>//' | sed 's/_.*//' | sort | uniq | wc EMBOSS applicaTons

EMBOSS Handle and Analyze Sequences with Command Lines EMBOSS HP://Emboss.Sourceforge.Net

Bioinformatics Study of Lectins: New Classification and Prediction In

Trichoderma Reesei Complete Genome Sequence, Repeat-Induced Point

Introduction to Bioinformatics (Elective) – SBB1609

A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences

GPS@: Bioinformatics Grid Portal for Protein Sequence Analysis on EGEE Grid

Software List for Biology, Bioinformatics and Biostatistics CCT

A New Graphical User Interface to EMBOSS

EBI, Expasy, EMBOSS, DTU)

Web & Grid Technologies in Bioinformatics, Computational And

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

A Converter Facilitating Genome Annotation Submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2*

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes