Basics on Bioinformafics Lecture 8
Total Page:16
File Type:pdf, Size:1020Kb
Basics on bioinforma-cs Lecture 8 Nunzio D’Agostino [email protected]; [email protected] Protein domain: terminology Superfamily: Proteins that have low sequence identity, but whose structural and functional features suggest a common evolutionary origin. Family: Proteins clustered together into families are clearly evolutionarily related (accepted rule: pairwise residue identity between the proteins >30% ). Domain: A domain is defined as a polypeptide chain or a part of polypeptide chain that can be independently fold into a stable tertiary structure. Domains are also units of function and are not unique to the protein products of one gene or one gene family but instead appear in a variety of proteins. Motif: A pattern of amino acids that is conserved across many proteins and confers a particular function to the protein. Site: Is the binding site where catalysis occurs. The structure and chemical properties of the active site allow the recognition and binding of the substrate. 2 Why protein domain identification? By iden*fying domains we can: o classify a new protein as belonging to a specific family o infer func*onality o infer cellular localizaon of a protein 3 Domain representation-patterns Some biologically significant amino acid paerns (mo*fs) can be summarised in the form of regular expressions. A regular expression is a powerful notaonal algebra that describes a string or a set of strings. One can use them whenever he/she wants to find paerns in strings. The standard notaons for describing regular expressions use these convenons: [AS] = A and S allowed. D = D allowed. x = Any symbol. x4 = Four arbitrary symbols. {PG} = Any symbol except P and G. [FY]2 = Two posi*ons where F and Y allowed. x(3,7) = Minimum 3 and maximum 7 residues. 4 Domain representation- patterns MSA Detect func*onally important residues Paern as regular expression: AVL]-L-[IV]-M-[TS]-C-[DE]-R-[FY]2- Q 5 Domain representation- PSSM/profile A PSSM is a Posi2on Specific Scoring Matrix. A profile is one type of PSSM. Sequence profile (Gribskov et al. 1987) is essen*ally a table that lists the frequencies of each amino acid in each posi*on of protein sequence. Frequencies are calculated from mul*ple alignments of related sequences (containing a domain of interest). PSSM scores are generally shown as posi*ve or negave integers. Posi*ve scores indicate that the given amino acid substuon occurs more frequently in the alignment than expected by chance, while negave scores indicate that the substuon occurs less frequently than expected. Large posive scores oen indicate cri*cal func*onal residues, which may be acve site residues or residues required for other intermolecular interac*ons 6 Domain representation-Markov model Markov model: a way of describing a process that goes through a series of states. In a regular Markov model, the state is directly visible to the observer. Each state has a probability of transi*oning to the other states. Xk is a random variable of state. States are ∈ {A,C,G,T} State transi*on example: State at the (K+1) th step A C G T A 0,3 0,2 0,1 0,4 C 0,1 0,6 0,2 0,1 G 0,2 0,4 0,1 0,3 T 0,5 0,1 0,2 0,1 State at the K th step th K the at State 7 Domain representation-hidden Markov model In a Hidden Markov model , the state is not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states. An essential characteristic of a Markov process is that the change is dependent only on the current state. (Teorema di Bayes) The history of the system does not matter. The states that the system has been in before are not relevant, only the current state determines what will happen next. The system has no memory. The Hidden Markov model method was originally used in speech recognition before being applied to biological sequence analysis. HMM are used to represent sequence families. A particular type of HMM suited to modeling multiple alignments. Domain representation-hidden Markov model Each oval shape represents a random variable that can adopt a number of values. The random variable x2 is the hidden state at time 2. From the diagram, it is clear that the value of the hidden variable x2 (at time 2) only depends on the value of the hidden variable x1 (at time 1). The arrows in the diagram denote conditional dependencies. Similarly, the value of the observed variable y2 only depends on the value of the hidden variable x2 (both at time 2). Hidden Markov model: Each state x emits an output y, at a specific probability. We only know the output (observations). Thus, the states are hidden. X1 X2 X3 X4 Y1 Y2 Y3 Y4 Basic Architecture of a profile HMM Start d1 d2 d3 End i0 i1 i2 i3 m1 m2 m3 C C Y 0.01 0.5 0.01 Insert states: Model insertions of random letters between two alignment positions Silent states: Model detection which correspond to a gap in the alignment Transitions: States of neighboring positions are connected by transitions, that indicate the possibility of going from one state to the other Match State: Model the distribution of symbols in the corresponding column of an alignment Domain representation-hidden Markov model Given a multiple sequence alignment of a particular domain family, one uses statistical methods to build a specific HMM for that domain family. In contrast to patterns and profiles, HMMs allow consistent treatment of insertions and deletions. In contrast to patterns and profiles, Markov Models take into account the information about neighboring residues. Protein domain db – PROSITE PROSITE is a database of biologically significant paerns and profiles that help to reliably iden*fy to which known protein family (if any) a new sequence belongs. There are a number of protein families as well as func*onal or structural domains that cannot be detected using paerns due to their extreme sequence divergence; the use of techniques based on weight matrices (also known as profiles) allows the detec*on of such domains. hp://prosite.expasy.org ABL1_HUMN Protein domain db – PROSITE Protein domain db – PRINTS The PRINTS database houses a collec*on of protein family fingerprints. Fingerprints = A recognized and powerful method of classifying new protein families is to use conserved regions within mul*ple alignments of related proteins. Each homologous region is a "mo*f", and sets of mo*fs provide a signature or fingerprint for unique iden*ficaon. These mo*fs usually denote a common structure and/or func*on between individual family members. hp://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php ABL1_HUMN Protein domain db – Pfam Pfam is a database of protein domain families. It contains curated mul*ple sequence alignments for each family and corresponding profile Hidden Markov Models (HMMs). HMMs are built from HMMER an hidden Markov model soaware, stand-alone available. This database is made up of two parts: Ø Pfam A: curated mul*ple alignments Ø Grows slowly Ø Quality controlled by experts Ø Pfam B: automac generated (ProDom derived) Ø Complements Pfam-A Ø New sequences instantly incorporated Ø Unchecked: false posi*ves, ... Protein domain db – Pfam hp://pfam.sanger.ac.uk ABL1_HUMN Protein domain db – ProDom ProDom families are built by an automated process based on a recursive use of PSI- BLAST (Posi*on Specific Iterated Blast) homology searches. PSI-BLAST PSI-BLAST is designed for more sensitive protein-protein similarity searches." Position-Specific Iterated (PSI)-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related proteins or new members of a protein family. " " Use PSI-BLAST when your standard protein-protein BLAST search either failed to find significant hits, or returned hits with descriptions such as "hypothetical protein" or "similar to...". " 18 Protein domain db – SMART SMART (Simple Modular Architecture Research Tool) domains are extensively annotated with respect to phyle*c distribu*ons, func*onal class, ter*ary structures and func*onally important residues. SMART alignments are op*mised manually and following construcon of corresponding hidden Markov models (HMMs). hp://smart.embl-heidelberg.de ABL1_HUMN Protein domain db – InterPro The resources examined un*l now, have different areas of op*mum applicaon owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them. InterPro (The InterPro Consor*um 2001) is a collaborave project aimed at providing an integrated layer on top of the most commonly used signature databases by crea*ng a unique, non-redundant characterisaon of a given protein family, domain or func*onal site. The InterPro project home page is available at: hp://www.ebi.ac.uk/interpro Entry types in InterPro o! Family: group of evolu*onarily related proteins, that share one or more domains/ repeats in common. o! Domain: independent structural unit which can be found alone or in conjunc*on with other domains or repeats. o! Repeat: region occurring more than once that is not expected to fold into a globular domain on its own. o! PTM: (post-translaonal modificaon) -The sequence mo*f is defined by the molecular recogni*on of this region in a cell. o! Ac2ve site:cataly*c pockets of enzymes where the cataly*c residues are known. o! Binding site: binds compounds but is not necessarily involved in catalysis. Protein domain db – InterPro InterProScan InterProScan is a tool that combines different protein signature recogni*on methods nave to the InterPro member databases into one resource with look up of corresponding InterPro and GO annotaon.