Background Knowledge Rules of Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 9: Building a multiple Background knowledge sequence alignment • Multiple alignment is a Swiss Army knife of bioinformatics:* – predicting protein structure – predicting protein function – phylogenetic analysis. The man with two watches never knows the time, and the man with one watch only thinks he knows. • It is both art and science. • It is easy to generate bad alignment that looks good. __A man with multiple watches *Perl is Swiss Army knife for bioinformatics software programmers. Multiple sequence alignment Rules of evolution example • Important aa are NOT allowed to mutate. • Less important residues change easily, sometimes randomly, sometimes to gain a new function. • Conserved positions in alignment are more important than non-conserved positions. Criteria for multiple sequence Applications of multiple sequence alignment alignment • Similarity •Extrapolation – structural: aa with the same role in • Phylogenetic analysis structure are in the same column (aligned), • Functional pattern identification – evolutionary: aa related to the same • Domain identification predecessor are aligned, • Identification of regulatory elements – functional: aa with the same function are aligned, • Structure prediction for proteins and RNA – sequence: aa that yield the best alignment • PCR analysis with the least degeneracy are aligned. http://blocks.fhcrc.org/codehop.html 1 Outline Gathering the sequences • Proteins are better than DNA (shorter, more • Gathering the sequences informative; exception primer design, non-coding • ClustalW target). • Protein family is usually too large to collect all •TCoffee sequences. Start with 10, remove troublemakers, and rerun with 50 sequences. • Comparing unalignable • Discard sequences with <30% and >90% identity. • Read DE (description) section of sequence accession to check for transposones). • Discard shorter than other sequences. • Extract (with Dotlet) sequences with repeated domains. • Note: There are mistakes in databases. Mistakes in databases Phylogenetic analysis on DNAs • Sequencing errors • Better on proteins. – revised Cambridge sequence of human mtDNA (Anderson 1981) • Translate DNA to protein. http://www.mitomap.org/mitomap/mitoseq.html • Taxonomic misidentification • Perform multiple alignment. – Amanita pantherina from Japan can be also A. ibotengutake (in mixed forests with Fagaceae • Thread DNA back onto protein alignment and Pinaceae). framework. – Russula drimeia var. queletii = Russula flavovirens = Russula queletii var. flavovirens = Russula sardonia sensu auct. mult. Restrict number of sequences Naming your sequences Computing and building accurate big alignment is difficult. • Never_use_white_spaces: – Amanita phalloides. • Displaying is difficult – aim for printing and monitor size. • Never use ßšpeciál symbols. • Using is difficult – phylogenetic programs can NOT handle them. • Shorten your name to 15 characters: • Included bad sequences multiply mistakes – avoid long – Amanita_phalloides indels and mavericks. • Use different names for different • Compromise between similarity and new information (30- 70(90)% identity). sequences: • Uncharacterized sequences are useful only if you want to – A.ph.261004_a see conserved - unmutable regions. 2 Step 1 Sequence selection BLASTing at ExPaSy server – avoid translated EST • NOT just the best ten http://www.expasy.ch/cgi-bin/BLASTEMBnet-CH.pl Sequence selection Step 2 • Similarity along the whole sequence – NOT just the bits. Import results from BLAST to alignment programs. Step 3 Program selection Run multiple alignment using different • FASTA and Swiss-Prot are good for methods, compare results. saving. • ClustalW handles more sequences. • TCoffee is more accurate than ClustalW. 3 Outline ClustalW • Gathering the sequences • http://www.ebi.ac.uk/clustalw/index.html • ClustalW • Paula Hogeweg described an algorithm, Des Higgins made a program Clustal. •TCoffee • Progressive method – adding sequences one • Comparing unalignable by one. • Input gaps are NOT removed – use FASTA (SeqCheck) to clean the sequence. • Order of input sometimes influences alignment. ClustalW principle ClustalW output • Compares sequences two by two: • Use default, output in input sequence order. – compares A and B, makes consensus AB • Reformat using fmtseq if needed – compares C and D, makes consenus CD http://bioweb.pasteur.fr/seqanal/interfaces/f mtseq.html. – compares AB and CD. •Results: • Does NOT use all the information at once. – pairwise scores (ignore) • Clustal W version – Clustal weights, every – multiple alignment sequence receive a weight proportional to – guide tree – „dendrogram“ used by Clustal in newick format (text and parenthesis), NOT a the amount of new info it contributes. proper phylogenetic tree. ClustalW parameters ClustalW mirrors • Change only if dissatisfied with blocks and •Europe weakly conserved positions (if alignment is http://www.ebi.ac.uk/clustalw/index.html difficult to interpret) •USA http://pir.georgetown.edu/pirwww/search/mu – substitution matrix – change BLOSUM to PAM. ltaln.html – gap-opening penalty (GOP) / Gap-extension • Japan http://clustalw.genome.ad.jp/ penalty (GEP) – try empirically, automatically readjusted. 4 Outline TCoffee • Gathering the sequences • http://igs-server.cnrs- mrs.fr/Tcoffee/tcoffee_cgi/index.cgi • ClustalW • Slower, more precise. • Uses progressive alignment like Clustal but it •TCoffee compares segments across the entire sequence set. • Comparing unalignable • First, ClustalW (and Lalign) make a collection of local and global alignments – „library“. • Second, libraries are progressively aligned to yield highest possible agreement with all pairwise alignments in the library. • New option – structural alignment. TCoffee choices 3-D Coffee Appropriate method used TCoffee output for each pair • aln – text file alignment, can be used as input for other programs; • html – colour coded, red indicates high quality, blue low; • pdf – html converted; • dnd – dendrogram used by TCoffee, in default Newick format; • ph – true phylogenetic tree in Newick format (generated by neighbour-joining method); • png –picture of ph. 5 Evaluating quality of alignment Looking at protein alignment with TCoffee • Align by other program. * entirely conserved column • Evaluate with TCoffee. : roughly the same size and hydropathy of aa • There are NOT E-values for multiple . the size or hydropathy preserved in the course alignment of evolution • Yellow, orange, and red means 80% chance Good block of 20 aa has 3 stars (*), 6 of being correctly aligned. colons (:), and few periods (.). …looking at protein alignment Patterns of conservation • Conserved columns in a multiple • Tryptophan – large hydrophobic aa, deep in the alignment are meaningful only when core, difficult to mutate. If mutates then to surroundings are NOT conserved. other aromatic (phenylalanine or tyrosine). WYF • Glycine or proline are on extremities of beta- strands or alpha helices. GP • Cysteines form disulphide bridge, distance is a signature of domain. CC • Histidine and serine in catalytic sites (of proteases). HS …patterns of conservation Refinement of alignment • Charged aa lysine, arginine, aspartic acid, • Adding distantly related sequences should and glutamic acid are involved in ligand enhance existing patterns rather than destroy binding or in a salt bridge inside the core. blocks. KRDE • Regions with indels – candidates of loops • Leucine conserved only in protein-protein (flexible part of structure, acting as interaction like leucine zipper. L connector). 6 Before adding a new sequence After (aln format) DCA Other multiple alignment methods • Dialign - alignments by comparing whole segments of the sequences, NOT residues. No gap penalty is used. For globally unrelated sequences with local similarities (genomic DNA and protein families). http://dialign.gobics.de/chaos-dialign- submission • DCA Divide-and-Conquer Alignment - heuristic approach to sum-of-pairs (SP) optimal alignment. http://bibiserv.techfak.uni- bielefeld.de/dca/submission.html When NOT to use multiple Outline sequence alignment: assembly • Gathering the sequences • ClustalW • Assembling sequence pieces in a genome •TCoffee sequencing project (use Phrap instead). • Comparing unalignable http://www.phrap.org/ 7 When NOT to use multiple sequence alignment: comparing Gibbs sampler unalignable • http://bioweb.pasteur.fr/seqanal/interfaces/gibb • Sequences without common ancestor, s-simple.html sequence without homologue. • Stochastic method – contains an element of • Look simultaneously at all sequences for chance, irreproducible. short conserved gap-free segments (Gibbs • Needs at least 20 sequences to start with. sampler). • Random alignment until good solution appears. • Look simultaneously at all sequences for • Good for identifying helix-turn-helix (HTH) flexible patterns – segments with gaps, domains and regulatory elements across a conserved at certain positions (Pratt). protein family. Chapter 10: Editing and Pratt publishing alignment • http://www.ebi.ac.uk/pratt/index.html • For motifs with different lengths. • Allows flexible spacing between the conserved positions. • Using PROSITE pattern-finding motif. It looked so good that I thought it just had to be genuine. __Everyone´s secret thought. Outline Background knowledge • Reformatting • Attempt to insert or delete a residue in •Jalview a subgroup of sequences manually can •Boxshade drive you crazy. • Logos 8 Ask yourself Residues before saving alignment •Charged KRDE, • Do most (of your) programs support