Chapter 9: Building a multiple Background knowledge • Multiple alignment is a Swiss Army knife of :* – predicting structure – predicting protein function – phylogenetic analysis. The man with two watches never knows the time, and the man with one watch only thinks he knows. • It is both art and science. • It is easy to generate bad alignment that looks good. __A man with multiple watches *Perl is Swiss Army knife for bioinformatics software programmers.

Multiple sequence alignment Rules of evolution example • Important aa are NOT allowed to mutate. • Less important residues change easily, sometimes randomly, sometimes to gain a new function. • Conserved positions in alignment are more important than non-conserved positions.

Criteria for multiple sequence Applications of multiple sequence alignment alignment • Similarity •Extrapolation – structural: aa with the same role in • Phylogenetic analysis structure are in the same column (aligned), • Functional pattern identification – evolutionary: aa related to the same • Domain identification predecessor are aligned, • Identification of regulatory elements – functional: aa with the same function are aligned, • Structure prediction for and RNA – sequence: aa that yield the best alignment • PCR analysis with the least degeneracy are aligned. http://blocks.fhcrc.org/codehop.html

1 Outline Gathering the sequences

• Proteins are better than DNA (shorter, more • Gathering the sequences informative; exception primer design, non-coding • ClustalW target). • Protein family is usually too large to collect all •TCoffee sequences. Start with 10, remove troublemakers, and rerun with 50 sequences. • Comparing unalignable • Discard sequences with <30% and >90% identity. • Read DE (description) section of sequence accession to check for transposones). • Discard shorter than other sequences. • Extract (with Dotlet) sequences with repeated domains. • Note: There are mistakes in databases.

Mistakes in databases Phylogenetic analysis on DNAs

errors • Better on proteins. – revised Cambridge sequence of human mtDNA (Anderson 1981) • Translate DNA to protein. http://www.mitomap.org/mitomap/mitoseq.html • Taxonomic misidentification • Perform multiple alignment. – Amanita pantherina from Japan can be also A. ibotengutake (in mixed forests with Fagaceae • Thread DNA back onto protein alignment and Pinaceae). framework. – Russula drimeia var. queletii = Russula flavovirens = Russula queletii var. flavovirens = Russula sardonia sensu auct. mult.

Restrict number of sequences Naming your sequences Computing and building accurate big alignment is difficult. • Never_use_white_spaces: – Amanita phalloides. • Displaying is difficult – aim for printing and monitor size. • Never use ßšpeciál symbols. • Using is difficult – phylogenetic programs can NOT handle them. • Shorten your name to 15 characters: • Included bad sequences multiply mistakes – avoid long – Amanita_phalloides indels and mavericks. • Use different names for different • Compromise between similarity and new information (30- 70(90)% identity). sequences: • Uncharacterized sequences are useful only if you want to – A.ph.261004_a see conserved - unmutable regions.

2 Step 1 Sequence selection

BLASTing at ExPaSy server – avoid translated EST • NOT just the best ten http://www.expasy.ch/cgi-bin/BLASTEMBnet-CH.pl

Sequence selection Step 2 • Similarity along the whole sequence – NOT just the bits. Import results from BLAST to alignment programs.

Step 3 Program selection

Run multiple alignment using different • FASTA and Swiss-Prot are good for methods, compare results. saving. • ClustalW handles more sequences. • TCoffee is more accurate than ClustalW.

3 Outline ClustalW

• Gathering the sequences • http://www.ebi.ac.uk/clustalw/index.html • ClustalW • Paula Hogeweg described an algorithm, Des Higgins made a program Clustal. •TCoffee • Progressive method – adding sequences one • Comparing unalignable by one. • Input gaps are NOT removed – use FASTA (SeqCheck) to clean the sequence. • Order of input sometimes influences alignment.

ClustalW principle ClustalW output

• Compares sequences two by two: • Use default, output in input sequence order. – compares A and B, makes consensus AB • Reformat using fmtseq if needed – compares C and D, makes consenus CD http://bioweb.pasteur.fr/seqanal/interfaces/f mtseq.html. – compares AB and CD. •Results: • Does NOT use all the information at once. – pairwise scores (ignore) • Clustal W version – Clustal weights, every – multiple alignment sequence receive a weight proportional to – guide tree – „dendrogram“ used by Clustal in newick format (text and parenthesis), NOT a the amount of new info it contributes. proper .

ClustalW parameters ClustalW mirrors

• Change only if dissatisfied with blocks and •Europe weakly conserved positions (if alignment is http://www.ebi.ac.uk/clustalw/index.html difficult to interpret) •USA http://pir.georgetown.edu/pirwww/search/mu – substitution matrix – change BLOSUM to PAM. ltaln.html – gap-opening penalty (GOP) / Gap-extension • Japan http://clustalw.genome.ad.jp/ penalty (GEP) – try empirically, automatically readjusted.

4 Outline TCoffee

• Gathering the sequences • http://igs-server.cnrs- mrs.fr/Tcoffee/tcoffee_cgi/index.cgi • ClustalW • Slower, more precise. • Uses progressive alignment like Clustal but it •TCoffee compares segments across the entire sequence set. • Comparing unalignable • First, ClustalW (and Lalign) make a collection of local and global alignments – „library“. • Second, libraries are progressively aligned to yield highest possible agreement with all pairwise alignments in the library. • New option – structural alignment.

TCoffee choices 3-D Coffee

Appropriate method used TCoffee output for each pair • aln – text file alignment, can be used as input for other programs; • html – colour coded, red indicates high quality, blue low; • pdf – html converted; • dnd – dendrogram used by TCoffee, in default Newick format; • ph – true phylogenetic tree in Newick format (generated by neighbour-joining method); • png –picture of ph.

5 Evaluating quality of alignment Looking at protein alignment with TCoffee • Align by other program. * entirely conserved column • Evaluate with TCoffee. : roughly the same size and hydropathy of aa • There are NOT E-values for multiple . the size or hydropathy preserved in the course alignment of evolution • Yellow, orange, and red means 80% chance Good block of 20 aa has 3 stars (*), 6 of being correctly aligned. colons (:), and few periods (.).

…looking at protein alignment Patterns of conservation

• Conserved columns in a multiple • Tryptophan – large hydrophobic aa, deep in the alignment are meaningful only when core, difficult to mutate. If mutates then to surroundings are NOT conserved. other aromatic (phenylalanine or tyrosine). WYF • Glycine or proline are on extremities of beta- strands or alpha helices. GP • Cysteines form disulphide bridge, distance is a signature of domain. CC • Histidine and serine in catalytic sites (of proteases). HS

…patterns of conservation Refinement of alignment

• Charged aa lysine, arginine, aspartic acid, • Adding distantly related sequences should and glutamic acid are involved in ligand enhance existing patterns rather than destroy binding or in a salt bridge inside the core. blocks. KRDE • Regions with indels – candidates of loops • Leucine conserved only in protein-protein (flexible part of structure, acting as interaction like leucine zipper. L connector).

6 Before adding a new sequence After (aln format)

DCA Other multiple alignment methods • Dialign - alignments by comparing whole segments of the sequences, NOT residues. No gap penalty is used. For globally unrelated sequences with local similarities (genomic DNA and protein families). http://dialign.gobics.de/chaos-dialign- submission • DCA Divide-and-Conquer Alignment - heuristic approach to sum-of-pairs (SP) optimal alignment. http://bibiserv.techfak.uni- bielefeld.de/dca/submission.html

When NOT to use multiple Outline sequence alignment: assembly • Gathering the sequences • ClustalW • Assembling sequence pieces in a genome •TCoffee sequencing project (use Phrap instead). • Comparing unalignable http://www.phrap.org/

7 When NOT to use multiple sequence alignment: comparing Gibbs sampler unalignable • http://bioweb.pasteur.fr/seqanal/interfaces/gibb • Sequences without common ancestor, s-simple.html sequence without homologue. • Stochastic method – contains an element of • Look simultaneously at all sequences for chance, irreproducible. short conserved gap-free segments (Gibbs • Needs at least 20 sequences to start with. sampler). • Random alignment until good solution appears. • Look simultaneously at all sequences for • Good for identifying helix-turn-helix (HTH) flexible patterns – segments with gaps, domains and regulatory elements across a conserved at certain positions (Pratt). protein family.

Chapter 10: Editing and Pratt publishing alignment • http://www.ebi.ac.uk/pratt/index.html • For motifs with different lengths. • Allows flexible spacing between the conserved positions. • Using PROSITE pattern-finding motif. It looked so good that I thought it just had to be genuine.

__Everyone´s secret thought.

Outline Background knowledge

• Reformatting • Attempt to insert or delete a residue in • a subgroup of sequences manually can •Boxshade drive you crazy. • Logos

8 Ask yourself Residues before saving alignment

•Charged KRDE, • Do most (of your) programs support •polar NQST, this format? • aliphatic ILMV, • What about your collaborators? •aromatic FYW, • Is all needed info included? • others APCGH. • Is it easy to manipulate?

• Stick to one format.

Outline Formats

• Reformatting • Arrangement of characters, symbols and keywords that specify what sequence, ID •Jalview name, comments, etc. look like in the sequence entry and where in the entry the •Boxshade program should look to find them. • Logos • There are never any hidden, unprintable „control“ characters in any sequence format. • MS Word is NOT sequence format, Simple ASCII is (pico text editor). • Interleaved - chopped to blocks of 60 residues.

Alignment in FASTA Frequent formats

– easy to manipulate by machines, NOT interleaved • pir – fasta with annotation line • msf – GCG package, multiple sequence format • selex –extended msf • aln - Clustal W format, simplified msf • – aln variant for phylogenetic analysis • post-script, pdf, html – graphic for printing; terminal format • xml – extensible markup language, future standard.

9 Alignment in MSF Alignment in SELEX

Alignment in ALN Alignment in PHYLIP

Conversion Possibly lost information

• Shorten name of sequence to 15 characters, • Sequence name – (shortened, included in otherwise name is included in sequence. • FMTSEQ – converts most formats. sequence), http://bioweb.pasteur.fr/seqanal/interfaces/fmtseq.ht • upper/lowercase, ml • http://www.ii.uib.no/~matthewb/tools/align_convert_ • gap type (., -, ~ turned to -), in.cgi • special aa and nucleotides (X and N). • SEQCHECK – cleans FASTA sequence. http://darwin.nmsu.edu/bioinfo/seqcheck/seqcheck.p hp • Some info can be lost, keep back-up copy of your sequence.

10 JalView Outline • http://www.jalview.org/ • Java applet: runs on your computer, your secret • Reformatting sequence does NOT travel through Internet if •Jalview you set File-Work Offline •Boxshade • Logos

Jalview output Colour scheme

• http://www.es.embnet.org/Doc/jalview/contents.html #colour • ClustalX colours • Zappo colours • Taylor colours • Hydrophobicity colours • Colouring residues above a percentage identity threshold • PID (Percentage Identity) Colours •BLOSUM62 colours • User defined colour schemes. Graph showing a level of conservation

Editing a group of sequences Defining a group

• To modify alignment collectively, define a group (mouse click while CTRL).

11 Editing a group of sequences Group editing mode

Steps Useful features of JalView

• Click anywhere on a sequence • Calculate - Autocalculate consensus: graph • Drag to the left to insert gaps below alignment is updated. • Calculate-Remove redundancy: removes x% • Drag to the right to remove gaps identical sequences. • Save intermediate results, there is NOT • Calculate-Neighbour joining tree using any undo button. PID: computes and displays phylogenetic tree, ready for group editing.

Saving output of JalView Other options

• File-Output alignment via text box • BioEdit • Select format, Apply •Open MS Word • QAlign • CTRL-C and CTRL-V • Save Word document as .doc or .txt. •CLC Workbench

12 Outline Boxshade

• http://www.ch.embnet.org/software/BOX_form. • Reformatting html •Jalview • Shades columns according to level of •Boxshade conservation • Logos • Black – identical • Grey - similar

Boxshade input Boxshade intermediate output

Boxshade output Outline

• Reformatting •Jalview •Boxshade • Logos

13 Logos Logos

• Position corresponds to a column in alignment. • Total height depends on the level of conservation. • Size of each letter in a logo position depends on frequency of letter in the column. • Top letter is the most frequent in the column.

Extracting info from multiple Logos sequence alignment • Tom Schneider http://www.bio.cam.ac.uk/cgi- bin/seqlogo/logo.cgi •Blocks– identifies blocks •Jan Gorodkin http://www.cbs.dtu.dk/~gorodkin/appl/plogo. •Blockgap– measures blocks html •Lama- compares alignment with the BLOCKs database • FASTA format, delete the name: •Amas– identifies features in alignment >name xxxx-xxxx to >xxxx-xxxx

Thank you for your attention!

14