Background Knowledge Rules of Evolution

Chapter 9: Building a multiple Background knowledge sequence alignment • Multiple alignment is a Swiss Army knife of bioinformatics:* – predicting protein structure – predicting protein function – phylogenetic analysis. The man with two watches never knows the time, and the man with one watch only thinks he knows. • It is both art and science. • It is easy to generate bad alignment that looks good. __A man with multiple watches *Perl is Swiss Army knife for bioinformatics software programmers. Multiple sequence alignment Rules of evolution example • Important aa are NOT allowed to mutate. • Less important residues change easily, sometimes randomly, sometimes to gain a new function. • Conserved positions in alignment are more important than non-conserved positions. Criteria for multiple sequence Applications of multiple sequence alignment alignment • Similarity •Extrapolation – structural: aa with the same role in • Phylogenetic analysis structure are in the same column (aligned), • Functional pattern identification – evolutionary: aa related to the same • Domain identification predecessor are aligned, • Identification of regulatory elements – functional: aa with the same function are aligned, • Structure prediction for proteins and RNA – sequence: aa that yield the best alignment • PCR analysis with the least degeneracy are aligned. http://blocks.fhcrc.org/codehop.html 1 Outline Gathering the sequences • Proteins are better than DNA (shorter, more • Gathering the sequences informative; exception primer design, non-coding • ClustalW target). • Protein family is usually too large to collect all •TCoffee sequences. Start with 10, remove troublemakers, and rerun with 50 sequences. • Comparing unalignable • Discard sequences with <30% and >90% identity. • Read DE (description) section of sequence accession to check for transposones). • Discard shorter than other sequences. • Extract (with Dotlet) sequences with repeated domains. • Note: There are mistakes in databases. Mistakes in databases Phylogenetic analysis on DNAs • Sequencing errors • Better on proteins. – revised Cambridge sequence of human mtDNA (Anderson 1981) • Translate DNA to protein. http://www.mitomap.org/mitomap/mitoseq.html • Taxonomic misidentification • Perform multiple alignment. – Amanita pantherina from Japan can be also A. ibotengutake (in mixed forests with Fagaceae • Thread DNA back onto protein alignment and Pinaceae). framework. – Russula drimeia var. queletii = Russula flavovirens = Russula queletii var. flavovirens = Russula sardonia sensu auct. mult. Restrict number of sequences Naming your sequences Computing and building accurate big alignment is difficult. • Never_use_white_spaces: – Amanita phalloides. • Displaying is difficult – aim for printing and monitor size. • Never use ßšpeciál symbols. • Using is difficult – phylogenetic programs can NOT handle them. • Shorten your name to 15 characters: • Included bad sequences multiply mistakes – avoid long – Amanita_phalloides indels and mavericks. • Use different names for different • Compromise between similarity and new information (30- 70(90)% identity). sequences: • Uncharacterized sequences are useful only if you want to – A.ph.261004_a see conserved - unmutable regions. 2 Step 1 Sequence selection BLASTing at ExPaSy server – avoid translated EST • NOT just the best ten http://www.expasy.ch/cgi-bin/BLASTEMBnet-CH.pl Sequence selection Step 2 • Similarity along the whole sequence – NOT just the bits. Import results from BLAST to alignment programs. Step 3 Program selection Run multiple alignment using different • FASTA and Swiss-Prot are good for methods, compare results. saving. • ClustalW handles more sequences. • TCoffee is more accurate than ClustalW. 3 Outline ClustalW • Gathering the sequences • http://www.ebi.ac.uk/clustalw/index.html • ClustalW • Paula Hogeweg described an algorithm, Des Higgins made a program Clustal. •TCoffee • Progressive method – adding sequences one • Comparing unalignable by one. • Input gaps are NOT removed – use FASTA (SeqCheck) to clean the sequence. • Order of input sometimes influences alignment. ClustalW principle ClustalW output • Compares sequences two by two: • Use default, output in input sequence order. – compares A and B, makes consensus AB • Reformat using fmtseq if needed – compares C and D, makes consenus CD http://bioweb.pasteur.fr/seqanal/interfaces/f mtseq.html. – compares AB and CD. •Results: • Does NOT use all the information at once. – pairwise scores (ignore) • Clustal W version – Clustal weights, every – multiple alignment sequence receive a weight proportional to – guide tree – „dendrogram“ used by Clustal in newick format (text and parenthesis), NOT a the amount of new info it contributes. proper phylogenetic tree. ClustalW parameters ClustalW mirrors • Change only if dissatisfied with blocks and •Europe weakly conserved positions (if alignment is http://www.ebi.ac.uk/clustalw/index.html difficult to interpret) •USA http://pir.georgetown.edu/pirwww/search/mu – substitution matrix – change BLOSUM to PAM. ltaln.html – gap-opening penalty (GOP) / Gap-extension • Japan http://clustalw.genome.ad.jp/ penalty (GEP) – try empirically, automatically readjusted. 4 Outline TCoffee • Gathering the sequences • http://igs-server.cnrs- mrs.fr/Tcoffee/tcoffee_cgi/index.cgi • ClustalW • Slower, more precise. • Uses progressive alignment like Clustal but it •TCoffee compares segments across the entire sequence set. • Comparing unalignable • First, ClustalW (and Lalign) make a collection of local and global alignments – „library“. • Second, libraries are progressively aligned to yield highest possible agreement with all pairwise alignments in the library. • New option – structural alignment. TCoffee choices 3-D Coffee Appropriate method used TCoffee output for each pair • aln – text file alignment, can be used as input for other programs; • html – colour coded, red indicates high quality, blue low; • pdf – html converted; • dnd – dendrogram used by TCoffee, in default Newick format; • ph – true phylogenetic tree in Newick format (generated by neighbour-joining method); • png –picture of ph. 5 Evaluating quality of alignment Looking at protein alignment with TCoffee • Align by other program. * entirely conserved column • Evaluate with TCoffee. : roughly the same size and hydropathy of aa • There are NOT E-values for multiple . the size or hydropathy preserved in the course alignment of evolution • Yellow, orange, and red means 80% chance Good block of 20 aa has 3 stars (*), 6 of being correctly aligned. colons (:), and few periods (.). …looking at protein alignment Patterns of conservation • Conserved columns in a multiple • Tryptophan – large hydrophobic aa, deep in the alignment are meaningful only when core, difficult to mutate. If mutates then to surroundings are NOT conserved. other aromatic (phenylalanine or tyrosine). WYF • Glycine or proline are on extremities of beta- strands or alpha helices. GP • Cysteines form disulphide bridge, distance is a signature of domain. CC • Histidine and serine in catalytic sites (of proteases). HS …patterns of conservation Refinement of alignment • Charged aa lysine, arginine, aspartic acid, • Adding distantly related sequences should and glutamic acid are involved in ligand enhance existing patterns rather than destroy binding or in a salt bridge inside the core. blocks. KRDE • Regions with indels – candidates of loops • Leucine conserved only in protein-protein (flexible part of structure, acting as interaction like leucine zipper. L connector). 6 Before adding a new sequence After (aln format) DCA Other multiple alignment methods • Dialign - alignments by comparing whole segments of the sequences, NOT residues. No gap penalty is used. For globally unrelated sequences with local similarities (genomic DNA and protein families). http://dialign.gobics.de/chaos-dialign- submission • DCA Divide-and-Conquer Alignment - heuristic approach to sum-of-pairs (SP) optimal alignment. http://bibiserv.techfak.uni- bielefeld.de/dca/submission.html When NOT to use multiple Outline sequence alignment: assembly • Gathering the sequences • ClustalW • Assembling sequence pieces in a genome •TCoffee sequencing project (use Phrap instead). • Comparing unalignable http://www.phrap.org/ 7 When NOT to use multiple sequence alignment: comparing Gibbs sampler unalignable • http://bioweb.pasteur.fr/seqanal/interfaces/gibb • Sequences without common ancestor, s-simple.html sequence without homologue. • Stochastic method – contains an element of • Look simultaneously at all sequences for chance, irreproducible. short conserved gap-free segments (Gibbs • Needs at least 20 sequences to start with. sampler). • Random alignment until good solution appears. • Look simultaneously at all sequences for • Good for identifying helix-turn-helix (HTH) flexible patterns – segments with gaps, domains and regulatory elements across a conserved at certain positions (Pratt). protein family. Chapter 10: Editing and Pratt publishing alignment • http://www.ebi.ac.uk/pratt/index.html • For motifs with different lengths. • Allows flexible spacing between the conserved positions. • Using PROSITE pattern-finding motif. It looked so good that I thought it just had to be genuine. __Everyone´s secret thought. Outline Background knowledge • Reformatting • Attempt to insert or delete a residue in •Jalview a subgroup of sequences manually can •Boxshade drive you crazy. • Logos 8 Ask yourself Residues before saving alignment •Charged KRDE, • Do most (of your) programs support

Background Knowledge Rules of Evolution

Comparative Analysis of Multiple Sequence Alignment Tools

"Phylogenetic Analysis of Protein Sequence Data Using The

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

HMMER User's Guide

Syntax Highlighting for Computational Biology Artem Babaian1†, Anicet Ebou2, Alyssa Fegen3, Ho Yin (Jeffrey) Kam4, German E

Computational Protocol for Assembly and Analysis of SARS-Ncov-2 Genomes

Embnet.News Volume 4 Nr

Supplementary Figures

Multiple Sequence Alignment

Bioinformatics and Sequence Alignment

Review Article an Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

Computational Analysis of Protein Function Within Complete Genomes