Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Talk overview • Sequence profiles – position specific scoring matrix • Psi-blast. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: PFAM BLOCKS PROSITE PRINTS InterPro • Correlated Mutations and structural insight • Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations PSSM – position specific scoring matrix • A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix Assuming a string S of length n S = s1s2s3...sn If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1 where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast). Sequence space PSI-BLAST • For a query sequence use Blast to find matching sequences. • Construct a multiple sequence alignment from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST • Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences. – Align those sequences to create a new substitution matrix for each position. – Then use that matrix to find additional sequences. Position-Specific-Iterated-BLAST PSI-BLAST Principle • Cycling/iterative method • First, a standard blastp is performed – Gives increased sensitivity for detecting distantly related • The highest scoring hits are used to generate a multiple proteins alignment – Can give insight into functional relationships – Very refined statistical methods •A PSSM is generated from the multiple alignment. • Fast and simple • Another similarity search is performed, this time using the new PSSM • Repeat previous steps until convergence (no new sequences appear after iteration) Sequence space Example: Aminoacyl tRNA Synthetases • Each is very different – Aminoacyl tRNA Synthetases are very different: size, multimers, etc… – But all bind to their own tRNAs and amino acids with high specificity. • TrpRS and TyrRS share only 13% sequence identity – Yet the structures of TrpTRS and TyrTRS are similar –Structure Ù Function relationship (See ellipsoid slide from previous lecture…) • Given structural similarities, we would expect to find sequence similarity… Same SCOP family based • However, blastp of E.coli TyrRS against bacterial on catalytic domain sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10 Overall structure similarity noted No TrpRS! After a few iterations… TrpRS Similarity to TyrRS! PSI-BLAST Using PSI-BLAST • PSI-BLAST available from BLAST web sites – Be sure to inspect and think about the results • Query form just like for blastp included in the PSSM build – BUT: one extra formatting option must be used – include/exclude sequences on basis of biological knowledge: you are in the driving seat! – A special e-value cutoff used to determine which – PSI-BLAST performance varies according to alignments will be used for PSSM build. choice of matrix, filter, statistics and nature of – PSI-BLAST also available from the stand alone data just like any other alignment tool. versions of BLAST. Why (not) PSI-BLAST Query • If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are true homologous, the sensitivity at a given specificity improves significantly. • However, if non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in false non-homologous sequences and will amplify the errors in the next rounds. Does the query really • If all hits in the first rounds are highly similar, then the have a relationship with the results? prediction power of the new PSSM will not be significantly better than of the original substitution matrix PSI-BLAST PSI-BLAST caveat on the command line • Increased ability to find distant homologues • As with simple BLAST searches, using PSI-BLAST on • Cost of additional required care to prevent non- the command line gives the user more power homologous sequences from being included in the • Opens up additional options, e.g. PSSM calculation – PSI-BLASTing over nucleotide databases – When in doubt, leave it out! – Examine sequences with moderate similarity carefully. – automating number of iterations • Be particularly cautious about matches to sequences – trying out lots of different settings in parallel with highly biased amino acid content – inputting multiple sequences – Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology – Screen them out of your query sequences! PFAM – Database of Protein families represented by HMM The Pfam database is a large collection of protein domain families. Each • The HMM are generated using the HMMER3 program, which is a new and family is represented by multiple sequence alignments and hidden efficient HMM builder. Markov models (HMMs). • There is a new option to search single DNA sequences against the library of There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam HMMs For each Pfam-A family Pfam builds a single curated profile hidden Markov model (HMM) from a seed alignment (a small set of • HMM models can be downloaded, as well as the multiple alignments of the representative members of the family). Pfam-B families have no seed and full alignments used to create the models. associated annotation or literature reference and are of much lower quality than Pfam-A families. Release 24.0 has 11912 families For each Pfam accession there is a family page, which can be accessed in several ways. Prosite http://www.expasy.org/prosite • PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns. One can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs. • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residues which is variously known as a pattern, motif, signature, or fingerprint. What is a sequence pattern A Pattern in our context is a Protein WORD conserved in many sequences: PVAILL A pattern lets you identify a protein family http://expasy.org/prosite/ Prosite patterns can describe complex signatures [RK]-x-[ST] This reads as follows: “an Arginine or a Lysine, followed by one random residue, followed by a Serine or a Threonine” C-[DES]-x-C-x(3)-I-x(3)-R-x(4)-P-x(4)-C-x(2)-C Is a signature for Zn finger proteins which bind DNA Using PrositeScan http://expasy.org/tools/scanprosite/ MALRAGLVLG FHTLMTLLSP QEAGATKADH MGSYGPAFYQ SYGASGQFTH EFDEEQLFSV DLKKSEAVWR LPEFGDFARF DPQGGLAGIA AIKAHLDILV ERSNRSRAIN VPPRVTVLPK SRVELGQPNI LICIVDNIFP PVINITWLRN GQTVTEGVAQ TSFYSQPDHL FRKFHYLPFV Using PrositeScan Using PROSITE-Scan: Structure Prints Direct PRINTS access: http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php By accession number By PRINTS code • PRINTS is a compendium of protein fingerprints which are conserved By database code motifs used to characterize a protein family. By text • Release 41.1 of PRINTS contains 2050 entries, encoding 12,121 individual By sequence motifs. By title • Two types of fingerprint are represented in the database: simple or By number of motifs By author composite. simple fingerprints are essentially single-motifs; while composite By query language fingerprints encode multiple motifs. • Most entries are of the latter type because discrimination power is greater for multi-component searches, and results are easier to interpret. Sequence logos • A sequence logo is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its frequency in that position and the total height of all the residues in the position is proportional to the conservation (information content) of the position InterPro Blocks http://www.ebi.ac.uk/interpro/ • InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. • Blocks are multiply aligned ungapped segments corresponding to the • It facilitates prediction for the occurrence of functional domains, repeats and most highly conserved regions of proteins. important sites. • The blocks for the Blocks Database are made automatically by • InterPro combines a number of databases (referred to as member databases) looking for the most highly conserved regions in groups of proteins. that use different methodologies to derive protein signatures. By uniting the • Blocks is not updated any more. The last version of database (14.3) member databases, InterPro capitalises

Sequence Motifs, Correlations and Structural Mapping of Evolutionary

Optimal Matching Distances Between Categorical Sequences: Distortion and Inferences by Permutation Juan P

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome

Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh

Development of Novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in Msas

Sequence Motifs, Information Content, and Sequence Logos Morten

The Biogrid Interaction Database

Seq2logo: a Method for Construction and Visualization of Amino Acid Binding Motifs and Sequence Profiles Including Sequence Weig

The Interpro Database, an Integrated Documentation Resource for Protein

3D Representations of Amino Acids—Applications to Protein Sequence Comparison and Classiﬁcation

Multiple Sequence Alignment

Interpreting a Sequence Logo When Initiating Translation, Ribosomes Bind to an Mrna at a Ribosome Binding Site Upstream of the AUG Start Codon