Basics on bioinforma cs Lecture 8
Nunzio D’Agostino [email protected]; [email protected] Protein domain: terminology
Superfamily: Proteins that have low sequence identity, but whose structural and functional features suggest a common evolutionary origin.
Family: Proteins clustered together into families are clearly evolutionarily related (accepted rule: pairwise residue identity between the proteins >30% ).
Domain: A domain is defined as a polypeptide chain or a part of polypeptide chain that can be independently fold into a stable tertiary structure. Domains are also units of function and are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins.
Motif: A pattern of amino acids that is conserved across many proteins and confers a particular function to the protein.
Site: Is the binding site where catalysis occurs. The structure and chemical properties of the active site allow the recognition and binding of the substrate.
2 Why protein domain identification?
By iden fying domains we can: o classify a new protein as belonging to a specific family o infer func onality o infer cellular localiza on of a protein
3 Domain representation-patterns
Some biologically significant amino acid pa erns (mo fs) can be summarised in the form of regular expressions.
A regular expression is a powerful nota onal algebra that describes a string or a set of strings. One can use them whenever he/she wants to find pa erns in strings.
The standard nota ons for describing regular expressions use these conven ons: [AS] = A and S allowed. D = D allowed. x = Any symbol. x4 = Four arbitrary symbols. {PG} = Any symbol except P and G. [FY]2 = Two posi ons where F and Y allowed.
x(3,7) = Minimum 3 and maximum 7 residues. 4 Domain representation- patterns
MSA
Detect func onally important residues
Pa ern as regular expression: AVL]-L-[IV]-M-[TS]-C-[DE]-R-[FY]2- Q
5 Domain representation- PSSM/profile
A PSSM is a Posi on Specific Scoring Matrix. A profile is one type of PSSM.
Sequence profile (Gribskov et al. 1987) is essen ally a table that lists the frequencies of each amino acid in each posi on of protein sequence. Frequencies are calculated from mul ple alignments of related sequences (containing a domain of interest).
PSSM scores are generally shown as posi ve or nega ve integers. Posi ve scores indicate that the given amino acid subs tu on occurs more frequently in the alignment than expected by chance, while nega ve scores indicate that the subs tu on occurs less frequently than expected. Large posi ve scores o en indicate cri cal func onal residues, which may be ac ve site residues or residues required for other intermolecular interac ons 6 Domain representation-Markov model
Markov model: a way of describing a process that goes through a series of states.
In a regular Markov model, the state is directly visible to the observer. Each state has a probability of transi oning to the other states.
Xk is a random variable of state.
States are ∈ {A,C,G,T} State transi on example: State at the (K+1) th step
A C G T A 0,3 0,2 0,1 0,4 C 0,1 0,6 0,2 0,1 G 0,2 0,4 0,1 0,3 T 0,5 0,1 0,2 0,1 State at the K th step th K the at State
7 Domain representation-hidden Markov model
In a Hidden Markov model , the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.
An essential characteristic of a Markov process is that the change is dependent only on the current state. (Teorema di Bayes) The history of the system does not matter.
The states that the system has been in before are not relevant, only the current state determines what will happen next. The system has no memory.
The Hidden Markov model method was originally used in speech recognition before being applied to biological sequence analysis.
HMM are used to represent sequence families. A particular type of HMM suited to modeling multiple alignments.
Domain representation-hidden Markov model
Each oval shape represents a random variable that can adopt a number of values. The random variable x2 is the hidden state at time 2. From the diagram, it is clear that the value of the hidden variable x2 (at time 2) only depends on the value of the hidden variable x1 (at time 1). The arrows in the diagram denote conditional dependencies. Similarly, the value of the observed variable y2 only depends on the value of the hidden variable x2 (both at time 2).
Hidden Markov model: Each state x emits an output y, at a specific probability. We only know the output (observations). Thus, the states are hidden.
X1 X2 X3 X4
Y1 Y2 Y3 Y4 Basic Architecture of a profile HMM
Start d1 d2 d3 End
i0 i1 i2 i3
m1 m2 m3 C C Y 0.01 0.5 0.01
Insert states: Model insertions of random letters between two alignment positions
Silent states: Model detection which correspond to a gap in the alignment
Transitions: States of neighboring positions are connected by transitions, that indicate the possibility of going from one state to the other
Match State: Model the distribution of symbols in the corresponding column of an alignment Domain representation-hidden Markov model
Given a multiple sequence alignment of a particular domain family, one uses statistical methods to build a specific HMM for that domain family.
In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions.
In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues. Protein domain db – PROSITE
PROSITE is a database of biologically significant pa erns and profiles that help to reliably iden fy to which known protein family (if any) a new sequence belongs. There are a number of protein families as well as func onal or structural domains that cannot be detected using pa erns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detec on of such domains.
h p://prosite.expasy.org
ABL1_HUMN Protein domain db – PROSITE Protein domain db – PRINTS
The PRINTS database houses a collec on of protein family fingerprints.
Fingerprints = A recognized and powerful method of classifying new protein families is to use conserved regions within mul ple alignments of related proteins. Each homologous region is a "mo f", and sets of mo fs provide a signature or fingerprint for unique iden fica on. These mo fs usually denote a common structure and/or func on between individual family members.
h p://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
ABL1_HUMN Protein domain db – Pfam Pfam is a database of protein domain families. It contains curated mul ple sequence alignments for each family and corresponding profile Hidden Markov Models (HMMs).
HMMs are built from HMMER an hidden Markov model so ware, stand-alone available.
This database is made up of two parts:
Ø Pfam A: curated mul ple alignments Ø Grows slowly Ø Quality controlled by experts
Ø Pfam B: automa c generated (ProDom derived) Ø Complements Pfam-A Ø New sequences instantly incorporated Ø Unchecked: false posi ves, ... Protein domain db – Pfam
h p://pfam.sanger.ac.uk
ABL1_HUMN Protein domain db – ProDom
ProDom families are built by an automated process based on a recursive use of PSI- BLAST (Posi on Specific Iterated Blast) homology searches. PSI-BLAST
PSI-BLAST is designed for more sensitive protein-protein similarity searches. Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...".
18 Protein domain db – SMART
SMART (Simple Modular Architecture Research Tool) domains are extensively annotated with respect to phyle c distribu ons, func onal class, ter ary structures and func onally important residues. SMART alignments are op mised manually and following construc on of corresponding hidden Markov models (HMMs).
h p://smart.embl-heidelberg.de
ABL1_HUMN Protein domain db – InterPro
The resources examined un l now, have different areas of op mum applica on owing to the different strengths and weaknesses of their underlying analysis methods.
Thus, for best results, search strategies should ideally combine all of them.
InterPro (The InterPro Consor um 2001) is a collabora ve project aimed at providing an integrated layer on top of the most commonly used signature databases by crea ng a unique, non-redundant characterisa on of a given protein family, domain or func onal site.
The InterPro project home page is available at:
h p://www.ebi.ac.uk/interpro Entry types in InterPro o Family: group of evolu onarily related proteins, that share one or more domains/ repeats in common. o Domain: independent structural unit which can be found alone or in conjunc on with other domains or repeats. o Repeat: region occurring more than once that is not expected to fold into a globular domain on its own. o PTM: (post-transla onal modifica on) -The sequence mo f is defined by the molecular recogni on of this region in a cell. o Ac ve site:cataly c pockets of enzymes where the cataly c residues are known. o Binding site: binds compounds but is not necessarily involved in catalysis. Protein domain db – InterPro InterProScan
InterProScan is a tool that combines different protein signature recogni on methods na ve to the InterPro member databases into one resource with look up of corresponding InterPro and GO annota on.
Protein signature recogni on methods: --Sequence-mo f methods PROSITE, home of regular expressions and profiles; Gene3D, PANTHER, PIRSF, Pfam, SMART, SUPERFAMILY and TIGRFAMs keepers of HMMs PRINTS, provider of fingerprint -- Sequence-cluster methods ProDom uses PSI-BLAST to find homologous sequences, that are clustered in the same ProDom entry InterProScan: creation of InterPro
Combine ScanRegEXP Prosite Protein Sequences results
ProfileScan Prosite Profile
FingerPrintScan PRINTS Look up InterPro HMMPfam InterProScan Pfam
BlastProdom ProDom Return list of HMMPfam SMART InterPro hits
HMMPfam TIGRFAMMs
HMMPanther PHANTER Application of InterPro
Diagnostic protein family signature database for: o Useful for member databases themselves o Enhancing the functional annotation of TrEMBL entries o Classification of proteins through text and sequence search tools o Large-scale classification using GO terms o Enhancing genome annotation o Proteome Analysis Database Protein domain: summary oDomains are the func onal units of proteins oIden fying a domain within a new protein may teach us much about it oThere are several types of models to represent domains oThese models can also be used to iden fy the domain they represent oMany Internet databases available to catalogue and iden fy families
HMMER
HMMER is used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilis c models called profile hidden Markov models (profile HMMs).
Compared to BLAST, FASTA, and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathema cal models. In the past, this strength came at significant computa onal expense, but in the new HMMER3 project, HMMER is now essen ally as fast as BLAST. HMMER hmmalign - align sequences to an HMM profile ohmmbuild - build a profile HMM from an alignment ohmmcalibrate - calibrate HMM search sta s cs hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database ohmmpfam - search one or more sequences against an HMM database ohmmsearch - search a sequence database with a profile HMM