<<

Talk overview • Sequence profiles – position specific scoring

• Psi-. Automated way to create and use sequence Sequence motifs, correlations profiles in similarity searches

and structural mapping of • Sequence patterns and sequence logos evolutionary data • Bioinformatic tools which employ sequence profiles: BLOCKS PROSITE PRINTS InterPro

• Correlated Mutations and structural insight

• Mapping sequence data on structures: March 2011 Eran Eyal Conservations Correlations

PSSM – position specific scoring matrix

• A position-specific scoring matrix (PSSM) is a commonly used representation of motifs (patterns) in biological sequences • PSSM enables us to represent multiple sequence alignments as mathematical entities which we can work with. • PSSMs enables the scoring of multiple alignments with sequences, or other PSSMs. PSSM – position specific scoring matrix

Assuming a string S of length n

S = s1s2s3...sn

If we want to score this string against our PSSM of length n (with n lines): n alignment _ score = m ∑ s j , j j=1

where m is the PSSM matrix and sj are the string elements. PSSM can also be incorporated to both dynamic programming algorithms and heuristic algorithms (like Psi-Blast).

Sequence space PSI-BLAST

• For a query sequence use Blast to find matching sequences. • Construct a multiple from the hits to find the common regions (consensus). • Use the “consensus” to search again the database, and get a new set of matching sequences • Repeat the process ! Sequence space Position-Specific-Iterated-BLAST

• Intuition – substitution matrices should be specific to sites and not global. – Example: penalize alanine→glycine more in a helix •Idea – Use BLAST with high stringency to get a set of closely related sequences. – Align those sequences to create a new substitution matrix for each position. – Then use that matrix to find additional sequences.

Position-Specific-Iterated-BLAST PSI-BLAST Principle

• Cycling/iterative method • First, a standard blastp is performed – Gives increased sensitivity for detecting distantly related • The highest scoring hits are used to generate a multiple proteins alignment – Can give insight into functional relationships – Very refined statistical methods •A PSSM is generated from the multiple alignment. • Fast and simple • Another similarity search is performed, this time using the new PSSM • Repeat previous steps until convergence (no new sequences appear after iteration) Sequence space Example: Aminoacyl tRNA Synthetases

• Each is very different – Aminoacyl tRNA Synthetases are very different: size, multimers, etc… – But all bind to their own tRNAs and amino acids with high specificity. • TrpRS and TyrRS share only 13% sequence identity – Yet the structures of TrpTRS and TyrTRS are similar –Structure Ù Function relationship (See ellipsoid slide from previous lecture…)

• Given structural similarities, we would expect to find sequence similarity… Same SCOP family based • However, blastp of E.coli TyrRS against bacterial on catalytic domain sequences in SwissProt does NOT show similarity with TrpRS at e-value cutoff of 10

Overall structure similarity noted No TrpRS!

After a few iterations…

TrpRS Similarity to TyrRS! PSI-BLAST Using PSI-BLAST

• PSI-BLAST available from BLAST web sites – Be sure to inspect and think about the results • Query form just like for blastp included in the PSSM build – BUT: one extra formatting option must be used – include/exclude sequences on basis of biological knowledge: you are in the driving seat! – A special e-value cutoff used to determine which – PSI-BLAST performance varies according to alignments will be used for PSSM build. choice of matrix, filter, and nature of – PSI-BLAST also available from the stand alone data just like any other alignment tool. versions of BLAST.

Why (not) PSI-BLAST Query

• If the sequences used to construct the Position Specific Scoring Matrices (PSSMs) are true homologous, the sensitivity at a given specificity improves significantly. • However, if non-homologous sequences are included in the PSSMs, they are “corrupted.” Then they pull in false non-homologous sequences and will amplify the errors in the next rounds. Does the query really • If all hits in the first rounds are highly similar, then the have a relationship with the results? prediction power of the new PSSM will not be significantly better than of the original substitution matrix PSI-BLAST PSI-BLAST caveat on the command line

• Increased ability to find distant homologues • As with simple BLAST searches, using PSI-BLAST on • Cost of additional required care to prevent non- the command line gives the user more power homologous sequences from being included in the • Opens up additional options, e.g. PSSM calculation – PSI-BLASTing over nucleotide databases – When in doubt, leave it out! – Examine sequences with moderate similarity carefully. – automating number of iterations • Be particularly cautious about matches to sequences – trying out lots of different settings in parallel with highly biased content – inputting multiple sequences – Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology – Screen them out of your query sequences!

PFAM – Database of Protein families represented by HMM The Pfam database is a large collection of families. Each • The HMM are generated using the HMMER3 program, which is a new and family is represented by multiple sequence alignments and hidden efficient HMM builder. Markov models (HMMs). • There is a new option to search single DNA sequences against the library of There are two levels of quality to Pfam families: Pfam-A and Pfam-B. Pfam HMMs For each Pfam-A family Pfam builds a single curated profile (HMM) from a seed alignment (a small set of • HMM models can be downloaded, as well as the multiple alignments of the representative members of the family). Pfam-B families have no seed and full alignments used to create the models. associated annotation or literature reference and are of much lower quality than Pfam-A families.

Release 24.0 has 11912 families

For each Pfam accession there is a family page, which can be accessed in several ways. Prosite http://www.expasy.org/prosite

• PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns. One can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

• In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residues which is variously known as a pattern, motif, signature, or fingerprint.

What is a sequence pattern

A Pattern in our context is a Protein WORD conserved in many sequences:

PVAILL

A pattern lets you identify a

http://expasy.org/prosite/ Prosite patterns can describe complex signatures

[RK]-x-[ST]

This reads as follows:

“an or a Lysine, followed by one random residue, followed by a Serine or a Threonine”

C-[DES]-x-C-x(3)-I-x(3)-R-x(4)-P-x(4)-C-x(2)-C Is a signature for Zn finger proteins which bind DNA

Using PrositeScan http://expasy.org/tools/scanprosite/

MALRAGLVLG FHTLMTLLSP QEAGATKADH MGSYGPAFYQ SYGASGQFTH EFDEEQLFSV DLKKSEAVWR LPEFGDFARF DPQGGLAGIA AIKAHLDILV ERSNRSRAIN VPPRVTVLPK SRVELGQPNI LICIVDNIFP PVINITWLRN GQTVTEGVAQ TSFYSQPDHL FRKFHYLPFV Using PrositeScan Using PROSITE-Scan: Structure

Prints Direct PRINTS access: http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php By accession number By PRINTS code • PRINTS is a compendium of protein fingerprints which are conserved By database code motifs used to characterize a protein family. By text • Release 41.1 of PRINTS contains 2050 entries, encoding 12,121 individual By sequence motifs. By title • Two types of fingerprint are represented in the database: simple or By number of motifs By author composite. simple fingerprints are essentially single-motifs; while composite By query language fingerprints encode multiple motifs. • Most entries are of the latter type because discrimination power is greater for multi-component searches, and results are easier to interpret. Sequence logos

• A is a graphical representation of aligned sequences where at each position the size of each residue is proportional to its in that position and the total height of all the residues in the position is proportional to the conservation ( content) of the position

InterPro Blocks http://www.ebi.ac.uk/interpro/

• InterPro is an integrated database of predictive protein "signatures" used for the classification and automatic annotation of proteins and genomes. • Blocks are multiply aligned ungapped segments corresponding to the • It facilitates prediction for the occurrence of functional domains, repeats and most highly conserved regions of proteins. important sites. • The blocks for the Blocks Database are made automatically by • InterPro combines a number of databases (referred to as member databases) looking for the most highly conserved regions in groups of proteins. that use different methodologies to derive protein signatures. By uniting the • Blocks is not updated any more. The last version of database (14.3) member databases, InterPro capitalises on their individual strengths, producing is from 2007. a powerful integrated database and diagnostic tool (InterProScan). The member databases use a number of approaches:

-ProDom: provider of sequence-clusters built from UniProtKB using PSI-BLAST. -PROSITEpatterns: provider of simple regular expressions. -PROSITEand HAMAP profiles: provide sequence matrices. - PRINTS provider of fingerprints, which are groups of aligned, un-weighted Position Specific Sequence Matrices (PSSMs). PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: are- providers of hidden Markov models (HMMs).

Entries typed Family contain signatures that cover all domains in the matching proteins and span >80% of the protein length with no adjacent signatures of type Domain or Region in >90% of the entry protein set. Entries typed Domain identify biological units with defined boundaries, which includes structural and functional domains as well as defined sub-domains.

Information extracted from multiple Correlated mutations sequence alignment (MSA) R E V N E K V N K E V N • Approaches to detect residue coupling R D V S D K V S • Applications D K V S E R V S • Some debates regarding correlated mutations tree determinant

conserved N consenzus C = i i N sequences Coupled 20 aa aa “correlated mutations” Ci = −∑ pi ln pi aa=1 So what correlated mutations can tell us Correlation coefficient Gobel et al. (1994), Proteins and where are they useful ?

Kass & Horovitz (OMES) Kass and Horovitz (2002), Proteins Basically every evolutionary constrained Very useful for RNA folding

In proteins: • Contact prediction • Analysis of important interactions Mutual Information • Analysis of allosteric paths and energetically coupled residues

SCA Lockless and Ranganathan (1999), Science

P2P

i, j – positions

ws, wt – sequence weights s, t - sequences

si – amino acid found at position i on sequence s

http://bip.weizmann.ac.il/correlated_mutations/ 400 amino acid pairs

400 amino acid pairs Blocks: small un-gapped multiple alignments R E V N P R G L A M E A V W N R w1 E K V N P E G L A V K A V W N G w2 K E V N P K G L A V E A V WN D w3 R D V S P R R I S V D S V W S K w4 D K V S A D E L A V K A V Y S N w5 D K V S G D K V G V K S V Y S A w6 E R V S G E K I G V R S V F S A w7 Instead of calculating correlations, we can derive universal Advantage: scores for substitutions between amino acid pairs •Accurate alignments •No gaps Score (R,E E,K) ? •Sequence weights How to derive such a substitution matrix? Representative structures for each block Eyal et al. (2007) Proteins, 67, 142-153

Signal Noise The pair-to-pair(P2P) substitution matrix

con nocon fobs [xy][uv] fobs [xy][uv] M[xy][uv] = ln con − ln nocon fexp [xy][uv] fexp [xy][uv]

con con nobs [xy][uv] f obs [xy][uv] = con ∑ nobs [ab][cd] abcd

con nobs[x][u] nobs[y][v] fexp [xy][uv] = ⋅ ∑nobs[a][b] ∑nobs[a][b] ab ab XY XY XY YX XY XZ XY WZ

Invariant pairs: Flipped pairs:

XY XY XYYX

P2P for contact prediction What is the matrix useful for?

• Detect contact between amino acids when there is no structural data • Evaluate structures

• Detect functional/structural important regions

Eyal et al. (2007) Proteins, 67, 142-153 P2P for contact prediction

Contacts prediction in smaller proteins is easier

Galectin-7 (1bkz) All prediction are between 2 β-sheets P2P as a scoring function for structure evaluation

j i j

i

Sij using P2P overall score: β-lactamase II (1bc2) Most predictions are around the metal S = ∑ Sij binding site i, j with contact

Advantages of the P2P method over other methods

• No need for large MSAs

• No need to construct evolutionary trees

• Naturally handle conservation and correlations

• Interactive web implementation Considering correlations together improves contact prediction – the GARP approach Methods based on the evolutionary tree

Considering also neighbors ADSDDFGRLIILM 2 mutation events and “windows” of u1 u correlations may improve 3 u2 w u predictions of primary 4 correlated mutations ADSDDFGRLIILL ADSDLFGVLIILM methods

a2 a a 4 1 v a3 GDTDDFGRLIILM ADSDDFGRLIILL ADTDLFGVLIILM ADSDLFGVLIILL

Frankel et al. (2007) BMC , 67, 142-153

Methods based on the evolutionary tree Although the same multiple alignment will be obtained in the 2 cases

GDTDDFGRLIILM ADSDDFGRLIILM ADSDDFGRLIILL ADTDLFGVLIILM ADSDLFGVLIILL 2 mutation events ADSDDFGRLIILL ADSDDFGRLIILM It is clear that evolutionary history of multiple independent events is 2 mutation a much stronger indication for real coupling events

Methods based on evolutionary tree:

ADSDLFGVLIILL GDTDDFGRLIILM ADTDDFGRLIILM ADSDLFGVLIILL Pagel M. (1994) Proc R Soc Lond Pollock D, Taylor W. (1997). Protein Eng Pollock D et al. (1999) J Mol Biol Tuffery and Darlu (2000) Mol Biol Evol Fleishman S et al. (2004) J Mol Biol Noivirt et al. (2005) Protein Eng Statistical coupling

i j R E V N E K V N K E V N MSA R D V S D K V S D K V S E K V S

i j E K V N D K V S MSA|δj D K V S E K V S aa 20 pδ paa ΔΔG stat = kT * (ln i| j − ln i )2 i, j ∑ aa aa aa=1 pMSA|δj pMSA For every selected j we can measure the coupling to all other sites i Lockless and Ranganathan (1999), Science, 286, 295-299

Statistical coupling

Lockless and Ranganathan (1999), Science, 286, 295-299 PDZ domain Lockless and Ranganathan (1999), Science, 286, 295-299 WW domain Cooperatively in WW domains

Russ et al. Nature (2005), 437, 579-583 Russ et al. Nature (2005), 437, 579-583

Is correlated mutations analysis really Studies using SCA meaningful?

Estabrook et al. (2005), PNAS methyltranferases

Marcelino et al. (2006), Proteins intracellular lipid binding proteins (iLBPs)

Swain et al. (2006), Curr Opin Str Biol chaperones

Chen et al. (2006), JBC Cys loop ligand-gated ion channels

Dima and Thirumalai (2006), Protein Sci Selectins

Ferguson et al. (2007), PNAS TonB-dependent transporters

Yu et al (2007), Biophys J DNA Helicases

Lee et al. (2008), Science PAS-DHFR

Hsu and Traugh (2010) PLoS One protein kinase Pak2 Which are the leading methods?

Fodor and Aldrich (2004), Proteins Halperin et al. (2006), Proteins

Can correlated mutations reveal allosteric pathways??

1 1 1 5 5 5 2 4 4 4 3 2 2

6 6 6

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 Fodor and Aldrich (2004), Proteins 6 6 6 A

B

Fodor and Aldrich (2004), JBC, 279, 19046-19050

Can CM detect interactions between different molecules? The main problems in detecting inter protein coupling

intra inter

• Correct selection of paralogs

• Basic assumptions? Do interfaces are conserved? Do all pairs interact?

• Smaller number of protein complexes and data about interfaces for testing/training

Halperin et al. (2006), Proteins Covariance analysis of Glutamate transporter

• 989 sequences were extracted from PFAM for the sodium- dicarboxylate symporter family (PF00375). residues from TM4c residues from TM2 • Alignment was modified such as the reference numbers are of Human EAAT1 protein. residues from TM4a

• Covariance analysis was performed using different methods

• Hierarchical clustering was used to analyze the matrices

residues in the core region

core region interface regions Is there a real connection between CM and energetically coupled residues?

Works for PDZ domain

Bouncing back

Different correlated mutations are appropriate for different tasks

McBasc OMES P2P MI SCA

Did Fodor implemented SCA appropriately ???

Number of sequences remaining after the perturbation should be considered The matrix is not symmetrical, but considered as such by Fodor. Not so fast….

Dima and Thirumalai (2006), Protein Sci 15, 258-268 Correlations in HIV-1 Protease Dataset and MSAs • cleavage of premature polypeptides to form the proteins required by the virus • a major drug target in AIDS therapies • exhibits multi-drug resistance • large amount of data available – sequence databases

– many solved structures IDV = indinavir, protease inhibitor – clinical information and known drug resistant NFV = nelfinavir, protease inhibitor mutations Data source: http://hivdb.stanford.edu/

Mutual Information Two Extreme Cases

• Mutual information measures the dependence • when X and X are independent between two random variables i j • Suppose X and X are two random variables. The i j • when X and X follow exactly the same mutual information between Xi and Xj is defined i j as distribution

joint probability singlet (marginal) probability

where xi and xj are specific amino acid types Mutual Information Matrix Clustering

• By calculating I(Xi, Xj) for all pairs of i, j, we Why do clustering? obtain an N×N mutual information matrix W – The origin of the correlations is not always with element I(Xi, Xj). pairwise, but most available statistical methods are based on pairwise metrics. Clustering helps in detecting more integrated patterns. – Enhance signal over noise (S/N ratio) (Noivirt et al., Protein Eng. Des. Sel., 2005)

• In our case N = 99.

Spectral Clustering Spectral Clustering Cut = S…. • A graph segmentation algorithm • Minimize the normalized cut between two groups (Shi and Malik, IEEE Trans, 2000)

• assoc(A, V) is the total weight of connection from A to all nodes in the graph

Scheme A Scheme B Back to the Protein Spectral Clustering

• Each column in the MSA corresponds to a • The problem reduces to solving a generalized residue, which in turn is represented as a eigenvalue problem node in the graph. • The mutual information is the weight of edge where D is a with element between node i and j. mutual information matrix = weight matrix W is the mutual information weight matrix Residue i Residue j mutual information • The eigenvector with the first nonzero eigenvalue is between X and X i j used to bi-partition the nodes

Sequence Correlation Matrix and its Permutation Based on Clustering Results • two clusters were distinguished based on spectral clustering procedure

treated • one of the cluster (blue) contains residues known data to be involved in multi-drug resistance • the other cluster (red) contains residues that exhibit substantial sequence variability between subtypes of HIV.

untreated data Gonzales et al. J. Infec. Dis. 2001

Cooperative Coupling Relation between Sequence Variability and Protein Dynamics (GNM) Relation between Sequence Variability and Protein Dynamics (cont) • Covariance analysis detects the drug-resistance mutations and their cooperativity in HIV-1 protease in agreement with experimental data. • Clustering techniques can be applied to analyse the data. • Relationship is elucidated between coevolving residue clusters and the collective dynamics of the protease.

the two clustering mobilities partitions

Correlated mutations - future directions Correlated mutations - summary

• Improve MSA – filtering out sequences • Correlated mutations analysis is a simple tool to detect coupling between residues. The tremendous amount of available sequences makes it more • N-body correlations instead of pair-wise correlations attractive • Improved clustering techniques • The tremendous amount of available sequences makes it more attractive

• Current methods can assist in detection of close tertiary contacts

• Depends on the application different CM methods should be applied

• Relation between sets of correlated paths has been suggested but not always in a consistent and convincing ways.

• Relation between CM and free-energy has been suggested but shown not to hold on a consistent basis ConSurf – mapping conservation scores ConSurf – mapping conservation scores on 3D structures on 3D structures

• ConSurf is a tool developed in TAU for mapping conservation scores on • Given the 3D-structure of a protein or a domain as an input, ConSurf protein structures (and recently nucleic acid) structures. extracts the sequence from the PDB .

• Detailed understanding of the mechanism of biological processes requires • It then carries out a search for close homologous sequences of the protein the identification of functionally important amino acids at the protein surface of known structure using PSI-BLAST. that are responsible for these interactions • Multiple sequence alignment is done using MUSCLE or CLUSTALW. The • ConSurf server is a useful and user-friendly tool that enables the multiple sequence alignment is used to build a phylogenetic tree. identification of functionally important regions on the surface of a protein of known three-dimensional structure, based on the evolutionary analysis. • Conservation scores are calculated based on Bayesian or Maximum Likelihood method. http://consurf.tau.ac.il/ • The protein, with the conservation scores color-coded onto its surface, can Ashkenazy H., Erez E., Martz E., Pupko T. and Ben-Tal N. 2010 finally be visualized on-line using Jmol. ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucl. Acids Res (2010)