Deriving Amino Acid Exchange Matrices (II) and Multiple Sequence Alignment (I) Summarysummary Dayhoff’Sdayhoff’S PAMPAM--Matricesmatrices
Total Page:16
File Type:pdf, Size:1020Kb
IntroductionIntroduction toto bioinformaticsbioinformatics lecturelecture 88 Deriving amino acid exchange matrices (II) and Multiple sequence alignment (I) SummarySummary Dayhoff’sDayhoff’s PAMPAM--matricesmatrices Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Several later groups have attempted to extend Dayhoff's methodology or re-apply her analysis using later databases with more examples. Extensions of Dayhoff’s methodology: > Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275). > Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology. > Henikoff & Henikoff (Proteins 17:49) compared these two newer versions of the PAM matrices with Dayhoff's originals. TheThe BLOSUMBLOSUM matricesmatrices ((BLOcksBLOcks SUbstitutionSUbstitution Matrix)Matrix) The BLOSUM series of matrices were created by Steve Henikoff and colleagues (PNAS 89:10915). Derived from local, un-gapped alignments of distantly related sequences. All matrices are directly calculated; no extrapolations are used. Again: the observed frequency of each pair is compared to the expected frequency (which is essentially the product of the frequencies of each residue in the dataset). Then: Log-odds matrix. TheThe BlocksBlocks DatabaseDatabase The Blocks Database contains multiple alignments of conserved regions in protein families. Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins. The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database. The database can be searched by e-mail and World Wide Web (WWW) servers (http://blocks.fhcrc.org/help) to classify protein and nucleotide sequences. TheThe BlocksBlocks DatabaseDatabase Gapless alignment blocks TheThe BLOSUMBLOSUM seriesseries BLOSUM30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90. The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (all blocks have >=62% sequence identity); No extrapolations are made in going to higher evolutionary distances High number - closely related sequences Low number - distant sequences BLOSUM62 is the most popular: best for general alignment. TheThe loglog--oddsodds matrixmatrix forfor BLOSUM62BLOSUM62 PAMPAM versusversus BLOSUMBLOSUM Based on an explicit Based on empirical evolutionary model frequencies Derived from small, Uses much larger, more closely related proteins diverse set of protein with ~15% divergence sequences (30-90% ID) Higher PAM numbers to Lower BLOSUM numbers detect more remote to detect more remote sequence similarities sequence similarities Errors in PAM 1 are Errors in BLOSUM arise scaled 250X in PAM 250 from errors in alignment ComparingComparing exchangeexchange matricesmatrices To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value (H) which describes the amount of information available per aligned residue pair. SpecializedSpecialized matricesmatrices Claverie (J.Mol.Biol 234:1140) developed a set of substitution matrices designed explicitly for finding possible frameshifts in protein sequences. These matrices are designed solely for use in protein-protein comparisons; they should not be used with programs which blindly translate DNA (e.g. BLASTX, TBLASTN). SpecializedSpecialized matricesmatrices Rather than starting from alignments generated by sequence comparison, Rissler et al (1988) and later Overington et al (1992) only considered proteins for which an experimentally determined three dimensional structure was available. They then aligned similar proteins on the basis of their structure rather than sequence and used the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Rissler or Overington matrices should give more reliable results than either PAM or BLOSUM. However, the comparatively small number of available protein structures (particularly in the Rissler et al study) limited the reliability of their statistics. Overington et al (1992) developed further matrices that consider the local environment of the amino acids. AA notenote onon reliabilityreliability All these matrices are designed using standard evolutionary models. It is important to understand that evolution is not the same for all proteins, not even for the same regions of proteins. No single matrix performs best on all sequences. Some are better for sequences with few gaps, and others are better for sequences with fewer identical amino acids. Therefore, when aligning sequences, applying a general model to all cases is not ideal. Rather, re-adjustment can be used to make the general model better fit the given data. PairPair--wisewise alignmentalignment qualityquality versusversus sequencesequence identityidentity (Vogt et al., JMB 249, 816-831,1995) SummarySummary If ORF exists, then align at protein level. Amino acid substitution matrices reflect the log-odds ratio between the evolutionary and random model and can therefore help in determining homology via the alignment score. The evolutionary and random models depend on the generalized data used to derive them. This not an ideal solution. Apart from the PAM and BLOSUM series, a great number of further matrices have been developed. Matrices have been made based on DNA, protein structure, information content, etc. For local alignment, BLOSUM62 is often superior; for distant (global) alignments, BLOSUM50, GONNET, or (still) PAM250 work well. Remember that gap penalties are always a problem; unlike the matrices themselves, there is no formal way to calculate their values -- you can follow recommended settings, but these are based on trial and error and not on a formal framework. BiologicalBiological definitionsdefinitions forfor relatedrelated sequencessequences Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues. Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution. Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event. Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc. SoSo thisthis meansmeans …… Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html MultipleMultiple sequencesequence alignmentalignment Sequences can be conserved across species and perform similar or identical functions. > hold information about which regions have high mutation rates over evolutionary time and which are evolutionarily conserved; > identification of regions or domains that are critical to functionality. Sequences can be mutated or rearranged to perform an altered function. > which changes in the sequences have caused a change in the functionality. Multiple sequence alignment: the idea is to take three or more sequences and align them so that the greatest number of similar characters are aligned in the same column of the alignment. WhatWhat toto askask yourselfyourself How do we get a multiple alignment? (three or more sequences) What is our aim? – Do we go for max accuracy, least computational time or the best compromise? What do we want to achieve each time SequenceSequence--sequencesequence alignmentalignment sequence sequence MultipleMultiple alignmentalignment methodsmethods Multi-dimensional dynamic programming > extension of pairwise sequence alignment. Progressive alignment > incorporates phylogenetic information to guide the alignment process Iterative alignment > correct for problems with progressive alignment by repeatedly realigning subgroups of sequence SimultaneousSimultaneous multiplemultiple alignmentalignment Multi-dimensional dynamic programming The combinatorial explosion 2 sequences of length n ¾ n2 comparisons Comparison number increases exponentially ¾ i.e. nN where n is the length of the sequences, and N is the number of sequences Impractical for even a small number of short sequences MultiMulti--dimensionaldimensional dynamicdynamic programmingprogramming (Murata et al., 1985) Sequence 1 3 ce en qu Se Sequence 2 TheThe MSAMSA approachapproach MSA (Lipman et al., 1989, PNAS 86, 4412) ¾ MSA restricts the amount of memory by computing bounds that approximate the centre of a multi-dimensional hypercube. ¾ Calculate all pair-wise alignment scores. ¾ Use the scores to to predict a tree. ¾ Calculate pair weights based on the tree (lower bound). ¾ Produce a heuristic alignment based on the tree. ¾ Calculate the maximum weight for each sequence pair (upper bound). ¾ Determine the spatial positions that must be calculated to obtain the optimal alignment. ¾ Perform